Skip to content

Remove compute capability restrictions from routerGemm and fused_topk_deepseek#2576

Open
yzh119 wants to merge 1 commit intomainfrom
remove-cc-restriction-router-gemm-topk
Open

Remove compute capability restrictions from routerGemm and fused_topk_deepseek#2576
yzh119 wants to merge 1 commit intomainfrom
remove-cc-restriction-router-gemm-topk

Conversation

@yzh119
Copy link
Copy Markdown
Collaborator

@yzh119 yzh119 commented Feb 18, 2026

Summary

  • Remove @supported_compute_capability and @backend_requirement decorators from routerGemm (mm_M1_16_K7168_N128, mm_M1_16_K7168_N256) and fused_topk_deepseek APIs
  • Both kernels use standard CUDA operations with SM90+ PDL features guarded by #if __CUDA_ARCH__ >= 900, so they work on all GPU architectures
  • routerGemm was previously restricted to SM100 only; fused_topk_deepseek was restricted to SM89/90/100/103/120/121
  • Shape/config validation is now called directly in the function body instead of via decorator
  • Router GEMM tests no longer skip on non-SM100 GPUs

Test plan

  • Run pytest tests/model_optimizations/test_dsv3_router_gemm.py on non-SM100 GPU (e.g. A100, H100)
  • Run pytest tests/model_optimizations/test_dsv3_fused_routing.py on non-SM100 GPU
  • Verify negative tests still raise ValueError for invalid inputs

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Refactor

    • Restructured internal validation logic for DSv3 routing and GEMM kernels, moving validation checks to explicit runtime calls instead of decorator-based checks while maintaining the same functional behavior.
  • Tests

    • Updated test execution to run unconditionally on all CUDA devices, removing hardware-specific conditional skipping.

…_deepseek

Both kernels use only standard CUDA operations (warp shuffles, shared
memory, cooperative_groups) with SM90+ PDL features properly guarded
by #if __CUDA_ARCH__ >= 900. They work on all GPU architectures, not
just the ones previously listed.

- Remove @supported_compute_capability and @backend_requirement from
  routerGemm (was restricted to SM100 only) and fused_topk_deepseek
  (was restricted to SM89/90/100/103/120/121)
- Call shape/config validation directly in the function body instead
- Remove SM100-only pytest.skip in router GEMM tests
- Update docstrings to remove Blackwell-specific language

AI-assisted

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @yzh119, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the flexibility and accessibility of the routerGemm and fused_topk_deepseek kernels by eliminating previous GPU compute capability restrictions. By removing architecture-specific decorators and integrating validation directly into the function logic, the kernels can now execute on a wider array of GPU hardware, leveraging standard CUDA and SM90+ PDL features. This change ensures broader compatibility without compromising the integrity of input validation.

Highlights

  • Removed Compute Capability Restrictions: The @supported_compute_capability and @backend_requirement decorators have been removed from routerGemm and fused_topk_deepseek kernels, allowing them to run on a broader range of GPU architectures.
  • Internal Validation: Shape and configuration validation calls were moved from decorators directly into the function bodies of the affected kernels, ensuring checks are still performed.
  • Broader GPU Support: Both kernels now utilize standard CUDA operations with SM90+ Persistent Device-side Launch (PDL) features, which are guarded by #if __CUDA_ARCH__ >= 900, making them compatible with all modern GPU architectures.
  • Test Suite Update: Router GEMM tests no longer skip on non-SM100 GPUs, enabling more comprehensive testing across different hardware.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • flashinfer/fused_moe/fused_routing_dsv3.py
    • Removed supported_compute_capability and backend_requirement decorators.
    • Moved the _check_dsv3_fused_routing_supported call into the fused_topk_deepseek function body.
    • Updated docstrings to reflect broader SM90+ PDL support and removed architecture-specific optimization notes.
  • flashinfer/gemm/routergemm_dsv3.py
    • Removed supported_compute_capability and backend_requirement decorators for mm_M1_16_K7168_N128 and mm_M1_16_K7168_N256.
    • Moved shape check calls (_mm_M1_16_K7168_N128_shape_checks, _mm_M1_16_K7168_N256_shape_checks) into their respective function bodies.
    • Updated docstrings to mention SM90+ PDL support and removed specific architecture specialization notes.
  • tests/model_optimizations/test_dsv3_router_gemm.py
    • Removed the import of get_compute_capability.
    • Eliminated pytest.skip conditions that previously restricted router GEMM tests to SM100 GPUs.
Activity
  • The pull request was created by yzh119. No further review comments or activities have been recorded yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 18, 2026

📝 Walkthrough

Walkthrough

This PR migrates runtime capability and shape validation from decorator-based checks to explicit function calls in DSv3 fused routing and GEMM operations. Validation logic is moved inside functions rather than enforced at declaration-time via decorators.

Changes

Cohort / File(s) Summary
Validation Migration in Routing and GEMM
flashinfer/fused_moe/fused_routing_dsv3.py, flashinfer/gemm/routergemm_dsv3.py
Removed @supported_compute_capability and @backend_requirement decorators; added explicit runtime validation calls within fused_topk_deepseek and mm_M1_16_K7168_N\* functions. Updated docstrings to clarify SM90+ device-side launch support.
Test Suite Updates
tests/model_optimizations/test_dsv3_router_gemm.py
Removed compute capability checks and conditional skips, allowing tests to run unconditionally on any CUDA device.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested labels

op: comm

Suggested reviewers

  • jiahanc
  • cyx-6
  • nvmbreughe
  • djmmoss

Poem

🐰 No more decorators crowding the way,
Validation now runs at call time each day!
From SM90+ heights, the kernels shall spring,
Explicit checks guard each computational thing.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 44.44% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title directly and clearly summarizes the main change: removing compute capability restrictions from two specific APIs (routerGemm and fused_topk_deepseek).
Description check ✅ Passed The PR description includes a comprehensive summary of changes, detailed rationale, and test plan items, though some sections from the template (Pre-commit Checks, passing tests confirmation) are not fully addressed.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch remove-cc-restriction-router-gemm-topk

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively removes the compute capability restrictions from routerGemm and fused_topk_deepseek APIs, which is a great improvement for broader hardware support. The changes are well-aligned with the description, replacing decorator-based checks with direct function calls and updating tests to run on more GPU architectures. The use of preprocessor guards in the CUDA code for architecture-specific features is correctly implemented.

I found one potential issue in the Python-side validation for fused_topk_deepseek where a check for the maximum number of experts in the multi-group case seems to be missing, which could lead to a runtime error. I've added a specific comment with a suggestion to address this.

Overall, this is a good change that improves usability and maintainability.

@claude
Copy link
Copy Markdown

claude bot commented Feb 18, 2026

Code Review

Summary

This PR removes @supported_compute_capability and @backend_requirement decorators from routerGemm (mm_M1_16_K7168_N128, mm_M1_16_K7168_N256) and fused_topk_deepseek APIs, replacing decorator-based validation with direct function calls in the body.

The rationale is sound: both kernels use standard CUDA operations with the only SM90+ feature being the optional PDL griddepcontrol instructions, which are already guarded by #if __CUDA_ARCH__ >= 900 at the PTX level.


Correctness Analysis

PDL host-side launch attribute (cudaLaunchAttributeProgrammaticStreamSerialization)

Both kernels unconditionally set this launch attribute regardless of launch_with_pdl value. When use_pdl=false, programmaticStreamSerializationAllowed = 0, which CUDA silently ignores on older GPUs. The PTX griddepcontrol instructions are compile-time guarded by #if __CUDA_ARCH__ >= 900. No correctness issue here.

fused_topk_deepseek defaults launch_with_pdl=True

On pre-SM90 GPUs (e.g. A100), the default launch_with_pdl=True will set programmaticStreamSerializationAllowed = 1 in the CUDA launch attributes. While functionally a no-op on older hardware (the PTX griddepcontrol is compile-guarded), it is semantically surprising and inconsistent with mm_M1_16_K7168_N128/N256 which both default to False. Consider either:

  • Changing the default to False for consistency, or
  • Adding an arch-aware default (e.g., auto-detect SM90+)

Design Considerations

Loss of is_compute_capability_supported / is_backend_supported introspection

Removing @backend_requirement means these APIs no longer expose .is_compute_capability_supported(cc) and .is_backend_supported(backend, cc) methods. Any external code using these introspection methods will silently break with AttributeError. Worth verifying whether downstream users or the benchmark framework rely on these.

Resolved TODOs

The removed # TODO: other compute capabilities may be supported but are untested comments were the previous justification for the SM100-only restriction. It would be good to document in the PR which non-SM100 architectures (e.g., A100, H100) were actually tested and verified.


Test Coverage

Test plan items are unchecked - Please update after verification on the target hardware.

No test for launch_with_pdl=True on non-SM90: There is no test that validates the PDL flag behavior on a non-Hopper GPU. Given this is the main behavioral change enabled by this PR, adding such a test or documenting verified hardware would be valuable.

Accuracy threshold: The test uses cos_sim > 0.99, which should hold on all targets, but confirming on actual A100/H100 runs before merge is important.


Documentation

The docstring changes are accurate and consistent — replacing Blackwell-specific claims with "PDL is SM90+ only" notes. Good.


Minor

  • from flashinfer.utils import (register_custom_op,) — trailing comma inside the parenthesized import is a harmless style nit.

Verdict

The core change is technically correct: these kernels work on pre-SM100 GPUs — the only SM90+ feature is PDL, which is conditionally compiled at the PTX level. The change removes unnecessary restrictions.

The main concern before merge is the launch_with_pdl=True default in fused_topk_deepseek: it is surprising UX on non-SM90 GPUs and inconsistent with the other two APIs that default to False. Verifying that no external callers depend on the removed .is_compute_capability_supported() introspection API is also worth doing.

Generated with Claude Code

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
flashinfer/fused_moe/fused_routing_dsv3.py (1)

42-42: ⚠️ Potential issue | 🟠 Major

Bug activated by this PR: incorrect guard condition blocks valid n_group=1 configurations.

This PR introduces the first call to _check_dsv3_fused_routing_supported (lines 180-190), making the previously dormant check at line 42 active. The condition topk_group * n_group < topk is semantically wrong for the n_group == 1 case:

  • With n_group=1, topk_group is forced to 1 (by the topk_group > n_group guard).
  • Then topk_group * n_group = 1 * 1 = 1, so any topk > 1 raises ValueError.
  • This directly contradicts the n_group == 1 branch at lines 67-75, which explicitly allows topk up to 8.

The formula should be topk > topk_group * (num_experts // n_group) (experts available in the selected groups), not topk > topk_group * n_group.

🐛 Proposed fix
-    if topk_group * n_group < topk or topk_group > n_group:
+    if topk > topk_group * (scores.shape[1] // n_group) or topk_group > n_group:
         raise ValueError(
-            f"Invalid configuration: topk_group * n_group ({topk_group * n_group}) must be >= topk ({topk}) "
+            f"Invalid configuration: topk_group * experts_per_group ({topk_group * (scores.shape[1] // n_group)}) must be >= topk ({topk}) "
             f"and topk_group ({topk_group}) must be <= n_group ({n_group})"
         )

Also applies to: 180-190

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/fused_moe/fused_routing_dsv3.py` at line 42, The guard at the top
of fused_routing_dsv3.py incorrectly uses topk_group * n_group to cap topk and
blocks valid n_group==1 configs; update the condition (and the duplicate check
inside _check_dsv3_fused_routing_supported) to compute available experts per
selected groups: replace "if topk_group * n_group < topk or topk_group >
n_group:" with a check that computes available = topk_group * (num_experts //
n_group) and then raises only if topk > available (and keep the existing
topk_group > n_group check if still desired), i.e., use topk > topk_group *
(num_experts // n_group) so topk is compared against the actual experts
available in the chosen groups.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@flashinfer/fused_moe/fused_routing_dsv3.py`:
- Line 42: The guard at the top of fused_routing_dsv3.py incorrectly uses
topk_group * n_group to cap topk and blocks valid n_group==1 configs; update the
condition (and the duplicate check inside _check_dsv3_fused_routing_supported)
to compute available experts per selected groups: replace "if topk_group *
n_group < topk or topk_group > n_group:" with a check that computes available =
topk_group * (num_experts // n_group) and then raises only if topk > available
(and keep the existing topk_group > n_group check if still desired), i.e., use
topk > topk_group * (num_experts // n_group) so topk is compared against the
actual experts available in the chosen groups.

@yzh119
Copy link
Copy Markdown
Collaborator Author

yzh119 commented Feb 18, 2026

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !324 has been created, and the CI pipeline #44314251 is currently running. I'll report back once the pipeline job completes.

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

[CANCELING] Pipeline #44314251: canceled

Copy link
Copy Markdown
Collaborator

@aleozlx aleozlx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Copy Markdown
Contributor

@jimmyzho jimmyzho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! ps. Continuing from our previous discussions on Support Checks: should we add a decorator to wrap a support check function to a specific API? Currently we have it bundled with @backend_requirement, I think decoupling that would make sense since many non-gemm interfaces do not have separate backends.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants