Skip to content

feat: add is_sm12x_supported() helper for SM12x family detection#2574

Merged
yongwww merged 4 commits intoflashinfer-ai:mainfrom
blake-snc:feat/add-is-sm12x-supported-helper
Feb 25, 2026
Merged

feat: add is_sm12x_supported() helper for SM12x family detection#2574
yongwww merged 4 commits intoflashinfer-ai:mainfrom
blake-snc:feat/add-is-sm12x-supported-helper

Conversation

@blake-snc
Copy link
Copy Markdown
Contributor

@blake-snc blake-snc commented Feb 17, 2026

Summary

Adds is_sm12x_supported() to flashinfer/utils.py as a convenience helper that covers the entire SM12x GPU family (SM120a, SM121a, and future variants like SM122a) without requiring callers to enumerate each minor version.

Uses a major == 12 check, matching the existing pattern of is_sm100a_supported() (major == 10). This means future SM12x variants are automatically covered without code changes.

Motivation: SM121a (DGX Spark) keeps getting missed when only SM120a is checked. This was noted by @eugr in #2560, and PR #2460 is another example where SM121a was not included alongside SM120a.

Changes

File Change
flashinfer/utils.py Add is_sm12x_supported() with major == 12 check
flashinfer/gemm/gemm_base.py Replace 3 instances of is_sm120a_supported(a.device) or is_sm121a_supported(a.device)
tests/attention/test_fmha_v2_prefill_deepseek.py Update skip guard to use is_sm12x_supported()

The individual is_sm120a_supported() and is_sm121a_supported() functions are preserved for cases that need variant-specific behavior.

Validated on DGX Spark (SM121a, CUDA 13.0).

Second Nature Computing

Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com

Summary by CodeRabbit

  • Refactor

    • Consolidated separate SM120/SM121 capability checks into a unified SM12x check and updated the public import surface accordingly.
    • Introduced explicit CUDA-version gating for SM12x variants and clarified related compatibility/error messages.
  • Tests

    • Updated GPU compatibility tests and skip logic/messages to target SM12x architecture support.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @blake-snc, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new utility function to simplify the detection of NVIDIA GPUs belonging to the SM12x family. By checking only the major compute capability version, the change ensures that all current and future SM12x variants are automatically recognized, preventing issues where specific minor versions might be inadvertently excluded. This enhancement improves the robustness and maintainability of GPU compatibility checks across the codebase.

Highlights

  • New Helper Function: Added is_sm12x_supported() to flashinfer/utils.py to provide a convenient way to detect any GPU within the SM12x family (e.g., SM120a, SM121a, and future variants) by checking the major compute capability version.
  • Simplified GPU Detection Logic: Replaced explicit checks for is_sm120a_supported() or is_sm121a_supported() with the new is_sm12x_supported() helper in flashinfer/gemm/gemm_base.py, streamlining the code and ensuring future SM12x variants are automatically covered.
  • Test Coverage Update: Updated the skip guard in tests/attention/test_fmha_v2_prefill_deepseek.py to utilize the new is_sm12x_supported() function, ensuring tests correctly reflect support for the entire SM12x family.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • flashinfer/gemm/gemm_base.py
    • Imported is_sm12x_supported from flashinfer.utils.
    • Replaced is_sm120a_supported(a.device) or is_sm121a_supported(a.device) with is_sm12x_supported(a.device) in gemm_fp8_nt_groupwise.
    • Replaced is_sm120a_supported(a.device) or is_sm121a_supported(a.device) with is_sm12x_supported(a.device) in _check_group_gemm_fp8_nt_groupwise_problem_size.
    • Replaced is_sm120a_supported(a.device) or is_sm121a_supported(a.device) with is_sm12x_supported(a.device) in group_gemm_fp8_nt_groupwise.
  • flashinfer/utils.py
    • Added is_sm12x_supported function, which checks if the device's major compute capability is 12 and CUDA version is at least 12.8.
  • tests/attention/test_fmha_v2_prefill_deepseek.py
    • Updated import statement to use is_sm12x_supported instead of is_sm120a_supported.
    • Modified the pytest.skip condition to check is_sm12x_supported and updated the corresponding skip message.
Activity
  • No human activity (comments, reviews) has been recorded on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 17, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Replaces separate SM120a/SM121a capability checks with a unified is_sm12x_supported(device) and updates call sites across utils, GEMM, prefill, and tests; enforces CUDA minimums per SM12x minor (12.8 for .0, 13.0 for .1+).

Changes

Cohort / File(s) Summary
Utils
flashinfer/utils.py
Added is_sm12x_supported(device) using get_compute_capability and CUDA-minimum logic (12.8 for SM120a, 13.0 for SM121a+); bumped is_sm121a_supported gate.
GEMM module
flashinfer/gemm/gemm_base.py
Public import surface updated to expose is_sm12x_supported; replaced prior is_sm120a_supported/is_sm121a_supported checks with is_sm12x_supported(device) in FP8/groupwise GEMM paths.
Prefill / FMHA
flashinfer/prefill.py
Replaced SM120/121 checks with is_sm12x_supported(device) for FMHA module selection and capability checks; added get_compute_capability import and improved CUDA-version-specific error messaging for SM12x GPUs.
Tests
tests/attention/test_fmha_v2_prefill_deepseek.py
Updated test skip logic and imports to use is_sm12x_supported(torch.device("cuda")) instead of separate SM120/121 checks.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • yzh119
  • cyx-6
  • bkryu
  • nvmbreughe
  • jimmyzho

Poem

🐰 I hopped through checks that once were two,
Now SM12x sings a single view.
CUDA gates set, the branches clear,
One tiny hop, the path draws near. ✨

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 45.45% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The PR description includes a clear Summary section explaining the motivation and approach, a detailed Changes table, and validation details. However, it does not follow the repository's pull request template structure with the required sections (Description, Related Issues, Pre-commit Checks, Tests). Restructure the description to follow the repository's pull request template, including proper sections for Description, Related Issues, Pre-commit Checks (with checkboxes), and Tests.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: adding a new is_sm12x_supported() helper function for SM12x GPU family detection, which is the primary purpose of this PR.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new helper function is_sm12x_supported to simplify GPU architecture checks, which is a good improvement for maintainability. The changes apply this new helper in gemm_base.py and a test file.

I have two main concerns:

  1. The implementation of is_sm12x_supported in flashinfer/utils.py seems to have a too-permissive CUDA version requirement for SM121a and future variants, which is inconsistent with existing checks and could lead to runtime errors.
  2. The test test_fmha_v2_prefill_deepseek was updated to use the new helper, but the function it tests, fmha_v2_prefill_deepseek, was not updated. It still contains a hardcoded check for SM120a, which will cause the test to fail on other SM12x devices.

Please see my detailed comments for suggestions on how to address these issues.

Comment on lines +60 to +61
if not is_sm12x_supported(torch.device("cuda")):
pytest.skip("fmha_v2_prefill_deepseek is only supported on SM12x GPUs.")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

While you've updated the test to run on SM12x devices, the fmha_v2_prefill_deepseek function itself in flashinfer/prefill.py still has a check for is_sm120a_supported. This will cause the test to fail with a ValueError on non-SM120a devices like SM121a.

Please update the check inside fmha_v2_prefill_deepseek to use is_sm12x_supported as well.

pattern used by ``is_sm100a_supported`` (``major == 10``).
"""
major, _ = get_compute_capability(device)
return major == 12 and version_at_least(torch.version.cuda, "12.8")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This check uses CUDA 12.8 as the minimum version for the entire SM12x family. However, is_sm121a_supported requires CUDA 12.9. This creates an inconsistency and could enable code paths for SM121a on an unsupported CUDA version, potentially leading to runtime errors.

If SM121a features used in flashinfer indeed work on CUDA 12.8, it would be best to also update is_sm121a_supported to use 12.8 for consistency. If not, this check is too permissive and should be stricter, for instance by checking the minor version and applying the appropriate CUDA version requirement.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@flashinfer/utils.py`:
- Around line 567-575: The SM12x support check is using CUDA 12.8 which is too
low for SM121a; update is_sm12x_supported to require CUDA >= 13.0 by changing
the version_at_least check to "13.0" and also update the companion
is_sm121a_supported helper (which currently checks "12.9") to require "13.0" so
SM121a systems won't be incorrectly treated as supported on CUDA 12.8–12.9;
modify the version_at_least arguments in the functions is_sm12x_supported and
is_sm121a_supported accordingly.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
flashinfer/prefill.py (1)

102-119: ⚠️ Potential issue | 🟡 Minor

Stale error message in the else branch.

Line 119 still says "SM100A is not supported on this device", but the condition now gates on three architectures (SM100a, SM110a, SM12x). A caller running on, e.g., an SM89 GPU will get a misleading error implying the issue is specifically SM100A.

🛠️ Suggested fix
-    raise ValueError("SM100A is not supported on this device")
+    raise ValueError(
+        "This device is not supported; requires SM100a, SM110a, or SM12x (SM120a/SM121a)."
+    )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/prefill.py` around lines 102 - 119, The error message in the else
branch is stale and misleading: update the ValueError raised by the fallback in
the function that checks is_sm100a_supported, is_sm110a_supported, and
is_sm12x_supported so it reflects the actual check (e.g., "Required
SM100a/SM110a/SM12x architecture is not supported on this device" or similar).
Locate the conditional that calls
gen_fmha_cutlass_sm100a_module(...).build_and_load() and replace the message in
the raise ValueError(...) to mention the supported architectures
(SM100a/SM110a/SM12x) or use a generic "required architecture not supported"
message.
🧹 Nitpick comments (1)
flashinfer/prefill.py (1)

3822-3823: Core fix is correct; optional: inline the exception message per Ruff TRY003.

Replacing is_sm120a_supported with is_sm12x_supported is the right fix — SM121a (DGX Spark) callers no longer hit a spurious ValueError. The Ruff TRY003 hint about long inline messages is a style nit; the rest of the codebase follows the same pattern, so deferring is fine.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/prefill.py` around lines 3822 - 3823, The change replacing
is_sm120a_supported with is_sm12x_supported is correct; as an optional style
improvement per Ruff TRY003, make the ValueError message a short inline string
in the precondition check in the fmha_v2_prefill_deepseek path (i.e., keep the
conditional using is_sm12x_supported and raise
ValueError("fmha_v2_prefill_deepseek is only supported on SM12x GPUs.") as a
single, inline message) so the exception text is concise and on one line.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@flashinfer/prefill.py`:
- Around line 102-119: The error message in the else branch is stale and
misleading: update the ValueError raised by the fallback in the function that
checks is_sm100a_supported, is_sm110a_supported, and is_sm12x_supported so it
reflects the actual check (e.g., "Required SM100a/SM110a/SM12x architecture is
not supported on this device" or similar). Locate the conditional that calls
gen_fmha_cutlass_sm100a_module(...).build_and_load() and replace the message in
the raise ValueError(...) to mention the supported architectures
(SM100a/SM110a/SM12x) or use a generic "required architecture not supported"
message.

---

Nitpick comments:
In `@flashinfer/prefill.py`:
- Around line 3822-3823: The change replacing is_sm120a_supported with
is_sm12x_supported is correct; as an optional style improvement per Ruff TRY003,
make the ValueError message a short inline string in the precondition check in
the fmha_v2_prefill_deepseek path (i.e., keep the conditional using
is_sm12x_supported and raise ValueError("fmha_v2_prefill_deepseek is only
supported on SM12x GPUs.") as a single, inline message) so the exception text is
concise and on one line.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@flashinfer/prefill.py`:
- Around line 3824-3825: The error raised in fmha_v2_prefill_deepseek when
is_sm12x_supported(query.device) is False uses a misleading hardware-only
message; update the ValueError text to mention the minimum CUDA version
requirement (e.g., "requires CUDA >= 13.0 and SM12x support") so users on newer
SM12x chips but older CUDA (like SM121a + CUDA 12.9) get the correct diagnosis;
mirror the clearer message used in get_fmha_module and ensure references to
is_sm12x_supported and fmha_v2_prefill_deepseek remain intact.
- Around line 119-121: The ValueError raised in prefill.py currently gives a
generic message that hides whether SM12x was rejected due to hardware major
version or due to insufficient CUDA; update the check around is_sm12x_supported
(or split its logic into two checks, e.g., is_sm12x_hardware and
is_sm12x_cuda_ok) and raise a clearer error that distinguishes the two cases —
for hardware mismatch say it requires SM100a/SM110a/SM12x, and for CUDA-version
mismatch include the detected CUDA version and the required minimum (e.g.,
"SM120x requires CUDA >= 12.8; SM121x requires CUDA >= 13.0; found X.Y"), so
callers of the code can tell if they need new hardware vs. a CUDA upgrade.

blake-snc and others added 4 commits February 19, 2026 17:09
Add a major-version-based helper that covers all SM12x GPUs (SM120a,
SM121a, and future variants) so callers don't need to enumerate each
minor version individually. Uses major == 12 check, matching the
pattern of is_sm100a_supported (major == 10).

Update existing call sites in gemm_base.py and the DeepSeek MLA test.

This avoids the recurring pattern where SM121a support gets missed when
only SM120a is checked, as noted in PR flashinfer-ai#2460 and flashinfer-ai#2560 discussion.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix CUDA version check: use minor-version-aware logic so SM121a
  requires CUDA 13.0 (per NVIDIA DGX Spark Porting Guide) while
  SM120a requires CUDA 12.8
- Update is_sm121a_supported from CUDA 12.9 to 13.0 for consistency
- Update fmha_v2_prefill_deepseek in prefill.py to use is_sm12x_supported
  instead of is_sm120a_supported (fixes SM121a ValueError)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The ValueError text said "SM100A is not supported" but the conditional
now gates on three architecture families.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ismatch

When is_sm12x_supported returns False on actual SM12x hardware due to
insufficient CUDA version, the error now reports the specific CUDA
requirement (12.8 for SM120x, 13.0 for SM121x) instead of a generic
"not supported" message.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@blake-snc blake-snc force-pushed the feat/add-is-sm12x-supported-helper branch from 5489c61 to 752fc7a Compare February 20, 2026 01:10
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
flashinfer/gemm/gemm_base.py (2)

3769-3795: ⚠️ Potential issue | 🟡 Minor

Misleading error message when CUDA version is the actual failure reason.

is_sm12x_supported now embeds a CUDA-version gate (12.8 for SM120, 13.0 for SM121+). If a user runs on genuine SM12x hardware but with an insufficient CUDA runtime, is_sm12x_supported returns False and the else branch fires with "Unsupported device for FP8 GEMM: {a.device}" — blaming the device instead of the CUDA version.

🛡️ Suggested fix
+    elif get_compute_capability(a.device)[0] == 12:
+        from ..utils import version_at_least
+        import torch as _torch
+        min_cuda = "13.0" if get_compute_capability(a.device)[1] >= 1 else "12.8"
+        raise ValueError(
+            f"FP8 GEMM on SM12x requires CUDA >= {min_cuda}, "
+            f"but found CUDA {_torch.version.cuda}."
+        )
     else:
         raise ValueError(f"Unsupported device for FP8 GEMM: {a.device}")

Alternatively, is_sm12x_supported could be split into a device-check and a CUDA-check so each path can produce a precise diagnostic.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/gemm/gemm_base.py` around lines 3769 - 3795, The current branch
uses is_sm12x_supported(a.device) which conflates device capability and CUDA
runtime version, so when CUDA is too old we misreport "Unsupported device...".
Split the check into a device-only check and a CUDA/runtime compatibility check
(e.g., add or call is_sm12x_device_only(a.device) and
is_sm12x_cuda_compatible()), then: if device-only is true but CUDA compatibility
is false raise a clear error about the CUDA runtime being insufficient (mention
required CUDA >= 12.8/13.0), if device-only is false keep the device-unsupported
error; update the branch around is_sm12x_supported/is_sm100a_supported and
adjust error messages accordingly so
get_gemm_sm120_module().gemm_fp8_nt_groupwise and the fallback paths report
precise diagnostics.

4130-4162: ⚠️ Potential issue | 🟠 Major

Missing else silently returns an uninitialized output tensor.

group_gemm_fp8_nt_groupwise has no else branch after the is_sm12x_supported / is_sm100a_supported chain. If neither condition holds — e.g. an SM12x device whose CUDA runtime is below the minimum, so is_sm12x_supported returns False, and is_sm100a_supported is also False — the function falls through and returns the torch.empty(...) tensor silently, producing garbage output with no diagnostic.

Compare gemm_fp8_nt_groupwise (line 3770–3795) which correctly has else: raise ValueError(...).

Since this PR changed line 4130 to use is_sm12x_supported (which now includes a CUDA-version gate), the set of inputs that reach the silent fall-through path has grown relative to the previous implementation.

🐛 Proposed fix
     elif is_sm100a_supported(a.device):
         get_gemm_sm100_module().group_gemm_fp8_nt_groupwise(
             int_workspace_buffer,
             float_workspace_buffer,
             a,
             b,
             a_scale,
             b_scale,
             out,
             m_indptr,
             n,
             k,
             *scale_granularity_mnk,
             scale_major_mode,
             mma_sm,
         )
+    else:
+        raise ValueError(f"Unsupported device for group FP8 GEMM: {a.device}")
     return out
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/gemm/gemm_base.py` around lines 4130 - 4162, The code path in
group_gemm_fp8_nt_groupwise can fall through when neither
is_sm12x_supported(a.device) nor is_sm100a_supported(a.device) is true,
returning an uninitialized out tensor; add an else branch that raises a clear
ValueError (mirroring gemm_fp8_nt_groupwise) indicating unsupported device/CUDA
combination and include identifying details (e.g., a.device and CUDA
runtime/version) so callers get a diagnostic instead of garbage output; modify
the function handling around the is_sm12x_supported / is_sm100a_supported checks
to raise the error instead of silently returning out.
🧹 Nitpick comments (2)
flashinfer/gemm/gemm_base.py (1)

4031-4037: SM12x num_groups > 1 validation is silently skipped when CUDA version is insufficient.

is_sm12x_supported(a.device) returns False on any SM12x device whose CUDA runtime is below the minimum (12.8 for SM120, 13.0 for SM121). In that scenario the correctness guard — "group_gemm_fp8_nt_groupwise has correctness issues for num_groups > 1 on SM120/121" — is never raised, even though the architectural restriction still applies. The guard should trigger based on the device's compute-capability family, not on whether the CUDA version is also sufficient.

♻️ Proposed fix
-    if is_sm12x_supported(a.device):
+    if get_compute_capability(a.device)[0] == 12:
         if num_groups > 1:
             raise RuntimeError(
                 "group_gemm_fp8_nt_groupwise has correctness issues for num_groups > 1 on SM120/121"
             )

get_compute_capability is already imported in this file (line 87).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/gemm/gemm_base.py` around lines 4031 - 4037, The current check
uses is_sm12x_supported(a.device) which returns False when the CUDA runtime is
too old, causing the num_groups > 1 guard for SM120/121 to be skipped; change
the logic to inspect the device compute-capability family via
get_compute_capability(a.device) and raise
RuntimeError("group_gemm_fp8_nt_groupwise has correctness issues for num_groups
> 1 on SM120/121") whenever the device reports SM120 or SM121 and num_groups >
1, regardless of is_sm12x_supported, while preserving the existing return True
behavior otherwise.
flashinfer/prefill.py (1)

103-129: LGTM — SM12x branching and error disambiguation are correct.

The condition (Lines 103–107) correctly unifies SM100a/SM110a/SM12x. The else-branch (Lines 119–129) properly distinguishes "SM12x hardware with old CUDA" from "unsupported hardware", addressing the earlier concern about misleading error messages.

The Ruff TRY003 warning (inline long error messages) applies to Lines 123–126 and 127–129. Extracting these into a custom exception class is an optional style improvement for this codebase's conventions.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/prefill.py` around lines 103 - 129, The long inline f-strings
triggering Ruff TRY003 should be replaced with custom exception(s) to shorten
inline messages: define a specific exception class (e.g., CudaVersionError or
UnsupportedDeviceError) and construct the detailed message in a variable before
raising; update the else-branch that calls get_compute_capability(device) to
build the min_cuda/message string (for SM12x vs unsupported cases) and raise the
new exception (referencing the get_compute_capability,
is_sm100a_supported/is_sm110a_supported/is_sm12x_supported checks and the
gen_fmha_cutlass_sm100a_module path) instead of using long inline f-strings.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@flashinfer/gemm/gemm_base.py`:
- Around line 3769-3795: The current branch uses is_sm12x_supported(a.device)
which conflates device capability and CUDA runtime version, so when CUDA is too
old we misreport "Unsupported device...". Split the check into a device-only
check and a CUDA/runtime compatibility check (e.g., add or call
is_sm12x_device_only(a.device) and is_sm12x_cuda_compatible()), then: if
device-only is true but CUDA compatibility is false raise a clear error about
the CUDA runtime being insufficient (mention required CUDA >= 12.8/13.0), if
device-only is false keep the device-unsupported error; update the branch around
is_sm12x_supported/is_sm100a_supported and adjust error messages accordingly so
get_gemm_sm120_module().gemm_fp8_nt_groupwise and the fallback paths report
precise diagnostics.
- Around line 4130-4162: The code path in group_gemm_fp8_nt_groupwise can fall
through when neither is_sm12x_supported(a.device) nor
is_sm100a_supported(a.device) is true, returning an uninitialized out tensor;
add an else branch that raises a clear ValueError (mirroring
gemm_fp8_nt_groupwise) indicating unsupported device/CUDA combination and
include identifying details (e.g., a.device and CUDA runtime/version) so callers
get a diagnostic instead of garbage output; modify the function handling around
the is_sm12x_supported / is_sm100a_supported checks to raise the error instead
of silently returning out.

---

Duplicate comments:
In `@flashinfer/prefill.py`:
- Around line 3840-3848: The control flow in the prefill CUDA/hardware check is
correct, but to address the Ruff TRY003 lint noted for the ValueError branches
in the block that uses is_sm12x_supported(query.device) and
get_compute_capability(query.device) (the fmha_v2_prefill_deepseek error-path),
either refactor the nested if/raise into a single conditional that raises the
appropriate ValueError for SM12x vs non-SM12 hardware, or keep the current
structure and add a local lint suppression (e.g., a noqa for TRY003) on the
ValueError(s) to silence the warning.

---

Nitpick comments:
In `@flashinfer/gemm/gemm_base.py`:
- Around line 4031-4037: The current check uses is_sm12x_supported(a.device)
which returns False when the CUDA runtime is too old, causing the num_groups > 1
guard for SM120/121 to be skipped; change the logic to inspect the device
compute-capability family via get_compute_capability(a.device) and raise
RuntimeError("group_gemm_fp8_nt_groupwise has correctness issues for num_groups
> 1 on SM120/121") whenever the device reports SM120 or SM121 and num_groups >
1, regardless of is_sm12x_supported, while preserving the existing return True
behavior otherwise.

In `@flashinfer/prefill.py`:
- Around line 103-129: The long inline f-strings triggering Ruff TRY003 should
be replaced with custom exception(s) to shorten inline messages: define a
specific exception class (e.g., CudaVersionError or UnsupportedDeviceError) and
construct the detailed message in a variable before raising; update the
else-branch that calls get_compute_capability(device) to build the
min_cuda/message string (for SM12x vs unsupported cases) and raise the new
exception (referencing the get_compute_capability,
is_sm100a_supported/is_sm110a_supported/is_sm12x_supported checks and the
gen_fmha_cutlass_sm100a_module path) instead of using long inline f-strings.

@yzh119 yzh119 added the run-ci label Feb 23, 2026
@yzh119
Copy link
Copy Markdown
Collaborator

yzh119 commented Feb 23, 2026

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !340 has been created, and the CI pipeline #44590176 is currently running. I'll report back once the pipeline job completes.

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

[FAILED] Pipeline #44590176: 14/20 passed

@blake-snc
Copy link
Copy Markdown
Contributor Author

@yzh119 when you can let me know what failed so I can get a fix going!

@yongwww
Copy link
Copy Markdown
Member

yongwww commented Feb 25, 2026

[FAILED] Pipeline #44590176: 14/20 passed

The ci results are good to go. Failures are due to timeout, and the main branch also had the same timeout failure.

@yongwww yongwww merged commit 17770f5 into flashinfer-ai:main Feb 25, 2026
45 of 52 checks passed
@coderabbitai coderabbitai bot mentioned this pull request Feb 27, 2026
5 tasks
ameynaik-hub pushed a commit to ameynaik-hub/flashinfer that referenced this pull request Mar 18, 2026
…shinfer-ai#2574)

## Summary

Adds `is_sm12x_supported()` to `flashinfer/utils.py` as a convenience
helper that covers the entire SM12x GPU family (SM120a, SM121a, and
future variants like SM122a) without requiring callers to enumerate each
minor version.

Uses a `major == 12` check, matching the existing pattern of
`is_sm100a_supported()` (`major == 10`). This means future SM12x
variants are automatically covered without code changes.

**Motivation:** SM121a (DGX Spark) keeps getting missed when only SM120a
is checked. This was noted by @eugr in flashinfer-ai#2560, and PR flashinfer-ai#2460 is another
example where SM121a was not included alongside SM120a.

## Changes

| File | Change |
|------|--------|
| `flashinfer/utils.py` | Add `is_sm12x_supported()` with `major == 12`
check |
| `flashinfer/gemm/gemm_base.py` | Replace 3 instances of
`is_sm120a_supported(a.device) or is_sm121a_supported(a.device)` |
| `tests/attention/test_fmha_v2_prefill_deepseek.py` | Update skip guard
to use `is_sm12x_supported()` |

The individual `is_sm120a_supported()` and `is_sm121a_supported()`
functions are preserved for cases that need variant-specific behavior.

Validated on DGX Spark (SM121a, CUDA 13.0).

[Second Nature Computing](https://joinsecondnature.com)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Refactor**
* Consolidated separate SM120/SM121 capability checks into a unified
SM12x check and updated the public import surface accordingly.
* Introduced explicit CUDA-version gating for SM12x variants and
clarified related compatibility/error messages.

* **Tests**
* Updated GPU compatibility tests and skip logic/messages to target
SM12x architecture support.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Amey Naik <212485788+ameynaik-hub@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants