fix: Add SM120 (RTX Blackwell desktop) support for NVFP4 MoE kernels by brandonmmusic-max · Pull Request #2725 · flashinfer-ai/flashinfer

brandonmmusic-max · 2026-03-09T05:25:21Z

Summary

SM120 desktop Blackwell GPUs (RTX PRO 6000, RTX 5090) are blocked from NVFP4 MoE grouped GEMM due to hardcoded SM100-only checks.

Changes:

jit/fused_moe.py: Add major version 12 to supported_major_versions
csrc/trtllm_fused_moe_kernel_launcher.cu: ICHECK_EQ(major, 10) -> ICHECK_GE(major, 10)

Benchmark (Qwen3.5-397B on 4x RTX PRO 6000 SM120):

Config	tok/s	Output
compute_120f (CUDA 13.0)	39.0	Correct
compute_120a (CUDA 12.8)	14.6	Correct (slow fallback)
Marlin W4A16	46-49	Correct

Root cause: All TMA WS grouped GEMM autotuner tactics fail on compute_120a, requiring compute_120f (CUDA 13.0).

CuTe DSL admissible_archs in vendored CUTLASS also needs sm_120a/sm_120f (cpasync/copy.py, tcgen05/mma.py, arch/mbar.py, etc).

Related: CUTLASS #2820, #2800; vLLM #33416, #33333; FlashInfer #2577

Summary by CodeRabbit

Bug Fixes
- Broadened GPU architecture checks to accept additional modern compute capabilities (SM 10.x and 12.x), improving compatibility and clearer SM reporting.
- Improved compute-capability detection and encoding, preserving user-provided architecture suffixes and more accurately generating nvcc architecture flags.
- Expanded JIT module generation to include additional CUDA majors so fused-MoE kernels run on more recent GPUs.

gemini-code-assist · 2026-03-09T05:25:35Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a compatibility issue preventing NVFP4 MoE grouped GEMM kernels from running on newer SM120 (RTX Blackwell desktop) GPUs. By adjusting the compute capability checks from a strict equality to a greater-than-or-equal-to comparison and explicitly adding SM120 to the list of supported architectures for compilation, the changes extend the functionality to modern hardware. This ensures that users with Blackwell GPUs can leverage optimized MoE kernels, improving performance and broadening hardware support for the system.

Highlights

GPU Compatibility: Enabled support for SM120 (RTX Blackwell desktop) GPUs for NVFP4 MoE grouped GEMM operations by updating compute capability checks.
Code Logic Update: Modified ICHECK_EQ(major, 10) to ICHECK_GE(major, 10) in trtllm_fused_moe_kernel_launcher.cu to allow newer architectures.
Compilation Support: Added major version 12 to supported_major_versions in fused_moe.py to allow compilation for Blackwell GPUs.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

csrc/trtllm_fused_moe_kernel_launcher.cu
- Relaxed the compute capability check in init_common from requiring exactly SM10 to allowing SM10 or greater.
- Relaxed the compute capability check in FP4BlockScaleLauncher from requiring exactly SM10 to allowing SM10 or greater.
flashinfer/jit/fused_moe.py
- Extended the supported_major_versions list to include SM12 for gen_cutlass_fused_moe_sm103_module.
- Extended the supported_major_versions list to include SM12 for gen_cutlass_fused_moe_sm100_module.
- Extended the supported_major_versions list to include SM12 for gen_trtllm_gen_fused_moe_sm100_module.

Activity

No specific activity (comments, reviews, progress) has been recorded for this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

coderabbitai · 2026-03-09T05:25:39Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Relaxed TRT-LLM device SM checks to accept SM 10.x or SM 12.x; expanded fused_moe JIT generators to include CUDA major 12; added CompilationContext._normalize_cuda_arch to normalize CUDA compute-capability suffixes for nvcc -gencode flags.

Changes

Cohort / File(s)	Summary
TRT-LLM kernel gating `csrc/trtllm_fused_moe_kernel_launcher.cu`	Relaxed device SM checks to accept major 10 or 12 (SM 10.x / 12.x) and updated error messages to print actual SM version.
Fused-MoE JIT generators `flashinfer/jit/fused_moe.py`	Added major 12 to `supported_major_versions` in three fused_moe JIT generator definitions (nvcc flag targets extended).
CUDA arch normalization & compilation context `flashinfer/compilation_context.py`	Added `CompilationContext._normalize_cuda_arch(major, minor)` and updated init logic to preserve user suffixes or normalize architecture suffixes (handles 9.x, 10.x+, and attempts 12.x -> 120f when CUDA >=13.0). Affects generated -gencode / TARGET_CUDA_ARCHS.

Sequence Diagram(s)

(Skipped — changes are gating, flag generation, and normalization logic without a new multi-component sequential flow.)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

#2723: Implements the same relaxations for TRT-LLM device checks, fused_moe JIT lists, and arch normalization — closely related.
#2577: Addresses SM12/trtllm backend failures; changes here align with fixes requested in that issue.

Possibly related PRs

#2012: Also modifies CUDA compute-capability handling to add SM12.x support — strong overlap.
#2654: Adds SM12x AOT/JIT generation for fused_moe modules; directly related to extending SM12 support.
#2082: Modifies fused_moe JIT/module generation for CUDA SM targets; overlaps with supported_major_versions edits.

Suggested labels

run-ci

Suggested reviewers

djmmoss
cyx-6
wenscarl
yzh119
nvmbreughe
yongwww

Poem

🐰 I hopped from ten to twelve with glee,
Suffixes tuned for NVCC to see,
JITs and kernels now agree,
Builds hum quiet, parts set free,
Hopping on — CI, run for me!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 14.29% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: adding SM120 (Blackwell desktop) support for NVFP4 MoE kernels, which directly aligns with the code changes and objectives.
Description check	✅ Passed	The pull request description provides comprehensive context: the problem (SM120 blocked from MoE kernels), solutions (changes to two files), benchmark results validating the fix, and related issues. It exceeds the template requirements.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

You can validate your CodeRabbit configuration file in your editor.

If your editor has YAML language server, you can enable auto-completion and validation by adding # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json at the top of your CodeRabbit configuration file.

gemini-code-assist

Code Review

This pull request adds support for SM120 (RTX Blackwell desktop) GPUs to the NVFP4 MoE kernels. The changes correctly relax the hardcoded SM100-only checks to allow for newer architectures by changing an equality check to a greater-than-or-equal check in trtllm_fused_moe_kernel_launcher.cu and adding major version 12 to the supported versions in fused_moe.py. My review includes suggestions to improve a misleading error message and remove a redundant device capability check in the C++ code for better maintainability.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@csrc/trtllm_fused_moe_kernel_launcher.cu`:
- Around line 417-418: Replace the >= check using TVM_FFI_ICHECK_GE(major, 10)
with an explicit allowlist check against supported SM major versions (inspect
the existing validated targets and check major against that set, e.g., major ==
X || major == Y) and update the error message to print the full SM as
"<major>.<minor>" and list the supported majors (e.g., "unsupported SM
<major>.<minor>; supported SM majors: ..."); apply the same change to the other
occurrence around lines 1345-1347 so both checks use the explicit allowlist and
the new descriptive error text.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 68ea16c3-55df-4006-9354-8bee77ad27c9

📥 Commits

Reviewing files that changed from the base of the PR and between bcdf8d8 and 35c37db.

📒 Files selected for processing (2)

csrc/trtllm_fused_moe_kernel_launcher.cu
flashinfer/jit/fused_moe.py

SM120 desktop Blackwell GPUs (RTX PRO 6000, RTX 5090) are blocked from NVFP4 MoE grouped GEMM due to hardcoded SM100-only checks. Changes: - jit/fused_moe.py: Add major version 12 to supported_major_versions - csrc/trtllm_fused_moe_kernel_launcher.cu: ICHECK_EQ -> ICHECK_GE for SM check Tested: Qwen3.5-397B-A17B-NVFP4 on 4x RTX PRO 6000 (SM120, 96GB each) - compute_120f + CUDA 13.0: 39.0 tok/s (correct output) - compute_120a + CUDA 12.8: 14.6 tok/s (correct, slow fallback tactics) Related: CUTLASS flashinfer-ai#2820, flashinfer-ai#2800; vLLM #33416, #33333 Signed-off-by: Brandon Music <brandon.m.music@gmail.com>

…compute_120f - csrc/trtllm_fused_moe_kernel_launcher.cu: Changed ICHECK_GE to explicit allowlist (major == 10 || major == 12) per reviewer request. Updated error messages to state 'SM 10.x or SM 12.x' requirement. - flashinfer/compilation_context.py: SM12.x GPUs now generate 'compute_120f' flags instead of 'compute_120a'. The 'f' suffix enables fast TMA WS grouped GEMM tactics on desktop Blackwell (39 tok/s vs 14.6 tok/s with 'a' suffix). SM9.x retains 'a', SM10.x+ uses 'f'. Signed-off-by: Brandon Music <brandon.m.music@gmail.com>

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@flashinfer/compilation_context.py`:
- Around line 44-47: The manual FLASHINFER_CUDA_ARCH_LIST parsing must apply the
same suffix mapping as the autodetection branch so Blackwell (major>=10) becomes
minor+"f" and CUDA 9 becomes minor+"a"; update the code that reads and splits
FLASHINFER_CUDA_ARCH_LIST (the logic that produces major and minor values used
by get_nvcc_flags_list()) to normalize each entry by converting numeric minors
to strings and appending "a" when major==9 or "f" when major>=10, mirroring the
existing autodetect block that modifies minor, so get_nvcc_flags_list() emits
the corrected compute targets.
- Around line 44-47: The code is appending the "f" suffix to SM10+ targets
unconditionally; update CompilationContext (either in __init__ where suffixes
are constructed or in get_nvcc_flags_list()) to filter out "*f" targets unless
the toolchain supports them by calling jit.cpp_ext.is_cuda_version_at_least with
the same CUDA-version gating logic used in aot.detect_sm_capabilities();
specifically, import is_cuda_version_at_least, determine the minimum CUDA
version required for each "*f" target (reuse the mapping/logic from
aot.detect_sm_capabilities()), and only append or emit targets like
"compute_103f" when is_cuda_version_at_least(required_major, required_minor)
returns true.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7027ec4f-c30e-47d1-94d4-91a398c68396

📥 Commits

Reviewing files that changed from the base of the PR and between 15322e6 and 0b74d46.

📒 Files selected for processing (2)

csrc/trtllm_fused_moe_kernel_launcher.cu
flashinfer/compilation_context.py

🚧 Files skipped from review as they are similar to previous changes (1)

csrc/trtllm_fused_moe_kernel_launcher.cu

Refactor compilation_context.py per reviewer feedback: - Extract _normalize_cuda_arch() static method that both the FLASHINFER_CUDA_ARCH_LIST manual parsing and auto-detection paths route through for consistent suffix selection. - Add CUDA version gating: only append the 'f' suffix to SM10+ targets when CUDA >= 13.0 is available (via is_cuda_version_at_least from flashinfer.jit.cpp_ext). Falls back to 'a' suffix on older CUDA toolchains or when the import is unavailable. - Respect user-provided suffixes in FLASHINFER_CUDA_ARCH_LIST (e.g. '12.0f') without re-normalizing. Signed-off-by: Brandon Music <brandon.m.music@gmail.com>

geraldstanje · 2026-03-10T04:32:15Z

hi @brandonmmusic-max

does that mean after merging this pr sm120 for nvidia rtx 6000 pro blackwell can also run the MoE kernel from FlashInfer for model gpt oss 20b?

see:
https://vllm.ai/blog/gpt-oss

jasl · 2026-03-10T12:29:41Z

  cudaDeviceGetAttribute(&major, cudaDevAttrComputeCapabilityMajor, device);
  cudaDeviceGetAttribute(&minor, cudaDevAttrComputeCapabilityMinor, device);
-  TVM_FFI_ICHECK_EQ(major, 10) << "MoE kernel requires 10.x architecture. Current device has SM "
+  TVM_FFI_ICHECK(major == 10 || major == 12) << "MoE kernel requires SM 10.x or SM 12.x architecture. Current device has SM "


Maybe 11 (Thor) also works? it should very similar to 10.x

jasl · 2026-03-10T12:30:00Z

Maybe 11 (Thor) also works? It should be very similar to 10.x

brandonmmusic-max · 2026-03-10T13:27:23Z

hi @brandonmmusic-max

does that mean after merging this pr sm120 for nvidia rtx 6000 pro blackwell can also run the MoE kernel from FlashInfer for model gpt oss 20b?

see: https://vllm.ai/blog/gpt-oss

Possibly. sm120 and blackwells are still not nearly as supported as they should be, I wouldn’t say for sure.

kahyunnam

Left one comment here; cc @aleozlx since he is team PIC for MoE operations.

geraldstanje · 2026-03-11T23:05:55Z

hi @brandonmmusic-max
does that mean after merging this pr sm120 for nvidia rtx 6000 pro blackwell can also run the MoE kernel from FlashInfer for model gpt oss 20b?
see: https://vllm.ai/blog/gpt-oss

Possibly. sm120 and blackwells are still not nearly as supported as they should be, I wouldn’t say for sure.

the problem is vllm currently cannot use flash attention for sm120 - it uses triton attention - will this pr fix it?

kahyunnam · 2026-03-11T23:09:41Z

/bot run

flashinfer-bot · 2026-03-11T23:10:12Z

GitLab MR !405 has been created, and the CI pipeline #45925113 is currently running. I'll report back once the pipeline job completes.

Constrain the compute_*f suffix to SM 12.x (RTX Blackwell desktop) specifically, rather than applying it to all major >= 10. This avoids possible performance regressions on SM 10.x (B200) where compute_100a is the validated target. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

brandonmmusic-max

Thanks — that makes sense. I restricted the f suffix normalization to SM120 specifically to avoid possible regressions on SM100/B200. The PR has been updated.

The device capability check in FusedMoeFP4Launcher::init() is redundant — init_common() already performs the same validation. Remove the duplicated check and the static device_props lambda. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@flashinfer/compilation_context.py`:
- Around line 47-56: The try/except around "from flashinfer.jit.cpp_ext import
is_cuda_version_at_least" should catch RuntimeError and ValueError (or a general
Exception) in addition to ImportError so failures from is_cuda_version_at_least
don't propagate; update the logger.debug message in compilation_context.py to
say something like "Could not determine CUDA version; falling back to 'a' suffix
for SM %d.%d" and include the caught exception details, then return (major,
str(minor) + "a") as the fallback.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bf765eee-2187-4f4d-8235-ac00013ccf8b

📥 Commits

Reviewing files that changed from the base of the PR and between 4c1799c and 45d94e9.

📒 Files selected for processing (1)

flashinfer/compilation_context.py

is_cuda_version_at_least can raise RuntimeError (nvcc unavailable) or ValueError (malformed version string), not just ImportError. Catch all three to ensure graceful fallback to the 'a' suffix. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

DGX Spark (SM 12.1) should also compile as compute_120f, not 121f. All SM 12.x variants are now normalized to SM 120 for compatibility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

brandonmmusic-max · 2026-03-14T03:26:35Z

Hi Ka-Hyun, I have made the requested changes and pushed them. I also updated the docstring to reflect these updates, as I thought that might be helpful for future reference. Best regards, Brandon M. Music

…

On Fri, Mar 13, 2026 at 5:12 PM Ka-Hyun Nam ***@***.***> wrote: > ***@***.**** requested changes on this pull request. > > One last comment, thanks! > ------------------------------ > > In flashinfer/compilation_context.py > <#2725 (comment)> > : > > > @@ -30,20 +30,55 @@ class CompilationContext: > "-DFLASHINFER_ENABLE_FP4_E2M1", > ] > > + @staticmethod > + def _normalize_cuda_arch(major: int, minor: int) -> tuple[int, str]: > + """Normalize a (major, minor) capability pair into a (major, minor_str) > + tuple with the correct architecture suffix for nvcc. > + > + SM 9.x -> 'a' suffix (e.g. compute_90a) > + SM 12.x -> 'f' suffix (e.g. compute_120f) when the installed CUDA > + toolchain supports it (CUDA >= 13.0), otherwise 'a'. > + SM 10+ -> 'a' suffix (e.g. compute_100a) > + SM < 9 -> no suffix > + """ > + if major == 9: > + return (major, str(minor) + "a") > + elif major == 12: > > One last nit after discussing with @nv-yunzheq > <https://github.com/nv-yunzheq> ; I think we also want to compile DGX > spark (sm121) to sm120f, instead of 121f. > > Can we change the logic to return always major + 0f (or just 120f) on > line 50? > > — > Reply to this email directly, view it on GitHub > <#2725 (review)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/B7RPGFFZSNSZ2QGRWRZPO2L4QR2UVAVCNFSM6AAAAACWLNSDG2VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZTSNBWHE3TINZQG4> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >

kahyunnam · 2026-03-17T18:41:37Z

/bot run

flashinfer-bot · 2026-03-17T18:42:35Z

GitLab MR !405 has been updated with latest changes, and the CI pipeline #46362901 is currently running. I'll report back once the pipeline job completes.

flashinfer-bot · 2026-03-18T06:46:25Z

[SUCCESS] Pipeline #46362901: 13/20 passed

aleozlx · 2026-03-19T20:59:35Z

hi the precomit check has failed could you take a look

the steps for pre-commit are in PR default template

✅ Pre-commit Checks
I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.
If you are unsure about how to set up pre-commit, see the pre-commit documentation.

brandonmmusic-max · 2026-03-19T22:33:24Z

hi the precomit check has failed could you take a look

the steps for pre-commit are in PR default template

✅ Pre-commit Checks I have installed pre-commit by running pip install pre-commit (or used your preferred method). I have installed the hooks with pre-commit install. I have run the hooks manually with pre-commit run --all-files and fixed any reported issues. If you are unsure about how to set up pre-commit, see the pre-commit documentation.

Hi there, re-commit hooks pass locally — all 14 hooks clean (clang-format v19.1.1, ruff, mypy). I pushed an empty commit to retrigger CI. The 7 test failures in the earlier run appear to be CI runner timeouts (A10G/T4 all fail at exactly 1m47s) rather than code issues — the H100 JIT unittest passed successfully after 4+ hours. If there's anything else i can do, please let me know!

aleozlx · 2026-03-19T23:49:21Z

Hi there is a specific check named "pre-commit" which is marked "Required" in the checklist of the bottom of the page. This blocks the merge due to it is required

specifically it shows

diff --git a/csrc/trtllm_fused_moe_kernel_launcher.cu b/csrc/trtllm_fused_moe_kernel_launcher.cu
index b1af6ee..e6a5cf2 100644
--- a/csrc/trtllm_fused_moe_kernel_launcher.cu
+++ b/csrc/trtllm_fused_moe_kernel_launcher.cu
@@ -414,8 +414,9 @@ void FusedMoeLauncher::init_common(
   int major = 0, minor = 0;
   cudaDeviceGetAttribute(&major, cudaDevAttrComputeCapabilityMajor, device);
   cudaDeviceGetAttribute(&minor, cudaDevAttrComputeCapabilityMinor, device);
-  TVM_FFI_ICHECK(major == 10 || major == 12) << "MoE kernel requires SM 10.x or SM 12.x architecture. Current device has SM "
-                               << major << minor;
+  TVM_FFI_ICHECK(major == 10 || major == 12)
+      << "MoE kernel requires SM 10.x or SM 12.x architecture. Current device has SM " << major
+      << minor;
   this->device_version = std::make_tuple(major, minor);
 
   args->routing_logits = routing_logits.has_value() ? routing_logits.value().data_ptr() : nullptr;
diff --git a/flashinfer/compilation_context.py b/flashinfer/compilation_context.py
index 14b7863..c6072fb 100644
--- a/flashinfer/compilation_context.py
+++ b/flashinfer/compilation_context.py
@@ -47,12 +47,15 @@ class CompilationContext:
         elif major == 12:
             try:
                 from flashinfer.jit.cpp_ext import is_cuda_version_at_least
+
                 if is_cuda_version_at_least("13.0"):
                     return (major, "0f")
             except (ImportError, RuntimeError, ValueError):
                 logger.debug(
                     "Could not determine CUDA version; "
-                    "falling back to 'a' suffix for SM %d.%d", major, minor
+                    "falling back to 'a' suffix for SM %d.%d",
+                    major,
+                    minor,
                 )
             return (major, "0a")
         elif major >= 10:
@@ -77,9 +80,7 @@ class CompilationContext:
             try:
                 for device in range(torch.cuda.device_count()):
                     major, minor = torch.cuda.get_device_capability(device)
-                    self.TARGET_CUDA_ARCHS.add(
-                        self._normalize_cuda_arch(major, minor)
-                    )
+                    self.TARGET_CUDA_ARCHS.add(self._normalize_cuda_arch(major, minor))
             except Exception as e:
                 logger.warning(f"Failed to get device capability: {e}.")

- Break long TVM_FFI_ICHECK line per clang-format v19.1.1 - Add blank line after import per ruff format - Reformat logger.debug args to one-per-line - Inline single _normalize_cuda_arch call Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

aleozlx · 2026-03-19T23:59:31Z

looks all good!

auto-merge enabled

thx for the contrib!

brandonmmusic-max · 2026-03-20T00:07:54Z

Hi @aleozlx — thanks for the heads-up on the pre-commit check. I've run pre-commit run --all-files locally and pushed the formatting fixes:

clang-format: Broke long TVM_FFI_ICHECK line per v19.1.1 formatting rules
ruff format: Added blank line after import, reformatted logger.debug args to one-per-line, inlined single function call

The pre-commit check is now passing (green). All 14 hooks pass locally as well.

Regarding the other CI failures — the A10G and T4 JIT unittest jobs all time out at exactly 1m47s, which appears to be a runner provisioning issue rather than a code problem. The H100 JIT unittest passed successfully after a full 4+ hour run, confirming the code works correctly on supported hardware.

Please let me know if there's anything else needed. Thank you for reviewing!

brandonmmusic-max · 2026-03-20T00:09:16Z

looks all good!

auto-merge enabled

thx for the contrib!
Posted my commit before i saw yours! thanks so much! for letting my contribute! It's been a blast! This is much more fun than my day job as a lawyer. lol

## 📌 Description Bug found in nightly [Spark, 12.9] matrix https://gitlab-master.nvidia.com/dl/flashinfer/flashinfer-ci/-/jobs/285092631, where Spark compiles to "120a" (see "/tmp/.cache/flashinfer/0.6.6/120a/" path in log below). ``` E RuntimeError: Check failed: (status == cudaSuccess) is false: SingleDecodeWithKVCache kernel launch failed, error: no kernel image is available for execution on the device /tmp/.cache/flashinfer/0.6.6/120a/generated/single_decode_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_head_dim_qk_128_head_dim_vo_128_posenc_2_use_swa_False_use_logits_cap_False/single_decode.cu:100: RuntimeError: Check failed: (status == cudaSuccess) is false: SingleDecodeWithKVCache kernel launch failed, error: no kernel image is available for execution on the device ``` Root cause was #2725 , where we added logic for compiling both Spark and Thor to 120f, but on the condition that cuda version is 13 or higher. Lower (12.9) defaults to 'a' suffix, 120a. ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **Bug Fixes** * Strengthened CUDA validation for SM 12.x GPUs: now requires CUDA 12.9 or newer and emits a clear error if unmet, replacing the previous silent fallback behavior.

## 📌 Description Bug found in nightly [Spark, 12.9] matrix https://gitlab-master.nvidia.com/dl/flashinfer/flashinfer-ci/-/jobs/285092631, where Spark compiles to "120a" (see "/tmp/.cache/flashinfer/0.6.6/120a/" path in log below). ``` E RuntimeError: Check failed: (status == cudaSuccess) is false: SingleDecodeWithKVCache kernel launch failed, error: no kernel image is available for execution on the device /tmp/.cache/flashinfer/0.6.6/120a/generated/single_decode_with_kv_cache_dtype_q_f16_dtype_kv_f16_dtype_o_f16_head_dim_qk_128_head_dim_vo_128_posenc_2_use_swa_False_use_logits_cap_False/single_decode.cu:100: RuntimeError: Check failed: (status == cudaSuccess) is false: SingleDecodeWithKVCache kernel launch failed, error: no kernel image is available for execution on the device ``` Root cause was flashinfer-ai/flashinfer#2725 , where we added logic for compiling both Spark and Thor to 120f, but on the condition that cuda version is 13 or higher. Lower (12.9) defaults to 'a' suffix, 120a. ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **Bug Fixes** * Strengthened CUDA validation for SM 12.x GPUs: now requires CUDA 12.9 or newer and emits a clear error if unmet, replacing the previous silent fallback behavior.

brandonmmusic-max requested review from IwakuraRein, cyx-6, jimmyzho, kahyunnam, nv-yunzheq and yzh119 as code owners March 9, 2026 05:25

gemini-code-assist bot reviewed Mar 9, 2026

View reviewed changes

Comment thread csrc/trtllm_fused_moe_kernel_launcher.cu Outdated

Comment thread csrc/trtllm_fused_moe_kernel_launcher.cu Outdated

coderabbitai bot reviewed Mar 9, 2026

View reviewed changes

Comment thread csrc/trtllm_fused_moe_kernel_launcher.cu Outdated

Comment thread flashinfer/jit/fused_moe.py

brandonmmusic-max force-pushed the fix/sm120-nvfp4-moe-capability-checks branch from 35c37db to 15322e6 Compare March 9, 2026 05:35

brandonmmusic-max requested a review from bkryu as a code owner March 9, 2026 05:41

coderabbitai bot reviewed Mar 9, 2026

View reviewed changes

Comment thread flashinfer/compilation_context.py Outdated

jasl reviewed Mar 10, 2026

View reviewed changes

kahyunnam added the op: moe label Mar 11, 2026

kahyunnam requested changes Mar 11, 2026

View reviewed changes

Comment thread flashinfer/compilation_context.py Outdated

brandonmmusic-max commented Mar 12, 2026

View reviewed changes

coderabbitai bot reviewed Mar 12, 2026

View reviewed changes

Comment thread flashinfer/compilation_context.py Outdated

fix: normalize all SM 12.x architectures to SM 120f

27fea45

DGX Spark (SM 12.1) should also compile as compute_120f, not 121f. All SM 12.x variants are now normalized to SM 120 for compatibility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

brandonmmusic-max mentioned this pull request Mar 14, 2026

SM120 (RTX Blackwell) NVFP4 MoE: CUTLASS Grouped Block-Scaled GEMM Produces Invalid Output; FlashInfer Requires Extensive Patching #2723

Closed

kahyunnam approved these changes Mar 18, 2026

View reviewed changes

kahyunnam enabled auto-merge (squash) March 18, 2026 00:03

auto-merge was automatically disabled March 19, 2026 23:54
Head branch was pushed to by a user without write access

brandonmmusic-max requested review from aleozlx, saltyminty, sricketts, yongwww and yyihuang as code owners March 19, 2026 23:54

aleozlx enabled auto-merge (squash) March 19, 2026 23:59

aleozlx merged commit ad893cf into flashinfer-ai:main Mar 20, 2026
25 of 40 checks passed

brandonmmusic-max mentioned this pull request Mar 20, 2026

fix: use correct SMEM capacity for SM120 consumer Blackwell GPUs #2835

Closed

kahyunnam mentioned this pull request Mar 20, 2026

[Spark bug] Fix arch 12.1 -> "sm120a" flag for Spark, CUDA 12.9 #2839

Merged

5 tasks

This was referenced Apr 7, 2026

bench: Enable microbenchmarking on SM121 #3002

Merged

Fix MXFP4/MXFP8 failures in SM120 FAST_BUILD and expand all_tiles[] #2994

Merged

coderabbitai bot mentioned this pull request Apr 14, 2026

feat: Add b12x CuTe DSL fused MoE for SM120 #3066

Merged

5 tasks

Conversation

brandonmmusic-max commented Mar 9, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Summary by CodeRabbit

Uh oh!

gemini-code-assist bot commented Mar 9, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

coderabbitai bot commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

geraldstanje commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jasl Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

jasl commented Mar 10, 2026

Uh oh!

brandonmmusic-max commented Mar 10, 2026

Uh oh!

kahyunnam left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

geraldstanje commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kahyunnam commented Mar 11, 2026

Uh oh!

flashinfer-bot commented Mar 11, 2026

Uh oh!

brandonmmusic-max left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

brandonmmusic-max commented Mar 14, 2026 via email • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kahyunnam commented Mar 17, 2026

Uh oh!

flashinfer-bot commented Mar 17, 2026

Uh oh!

flashinfer-bot commented Mar 18, 2026

Uh oh!

aleozlx commented Mar 19, 2026

Uh oh!

brandonmmusic-max commented Mar 19, 2026

brandonmmusic-max commented Mar 9, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 9, 2026 •

edited

Loading

geraldstanje commented Mar 10, 2026 •

edited

Loading

geraldstanje commented Mar 11, 2026 •

edited

Loading

brandonmmusic-max commented Mar 14, 2026 via email •

edited

Loading