Remove compute capability restrictions from routerGemm and fused_topk_deepseek by yzh119 · Pull Request #2576 · flashinfer-ai/flashinfer

yzh119 · 2026-02-18T02:30:26Z

Summary

Remove @supported_compute_capability and @backend_requirement decorators from routerGemm (mm_M1_16_K7168_N128, mm_M1_16_K7168_N256) and fused_topk_deepseek APIs
Both kernels use standard CUDA operations with SM90+ PDL features guarded by #if __CUDA_ARCH__ >= 900, so they work on all GPU architectures
routerGemm was previously restricted to SM100 only; fused_topk_deepseek was restricted to SM89/90/100/103/120/121
Shape/config validation is now called directly in the function body instead of via decorator
Router GEMM tests no longer skip on non-SM100 GPUs

Test plan

Run pytest tests/model_optimizations/test_dsv3_router_gemm.py on non-SM100 GPU (e.g. A100, H100)
Run pytest tests/model_optimizations/test_dsv3_fused_routing.py on non-SM100 GPU
Verify negative tests still raise ValueError for invalid inputs

🤖 Generated with Claude Code

Summary by CodeRabbit

Refactor
- Restructured internal validation logic for DSv3 routing and GEMM kernels, moving validation checks to explicit runtime calls instead of decorator-based checks while maintaining the same functional behavior.
Tests
- Updated test execution to run unconditionally on all CUDA devices, removing hardware-specific conditional skipping.

…_deepseek Both kernels use only standard CUDA operations (warp shuffles, shared memory, cooperative_groups) with SM90+ PDL features properly guarded by #if __CUDA_ARCH__ >= 900. They work on all GPU architectures, not just the ones previously listed. - Remove @supported_compute_capability and @backend_requirement from routerGemm (was restricted to SM100 only) and fused_topk_deepseek (was restricted to SM89/90/100/103/120/121) - Call shape/config validation directly in the function body instead - Remove SM100-only pytest.skip in router GEMM tests - Update docstrings to remove Blackwell-specific language AI-assisted Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gemini-code-assist · 2026-02-18T02:30:43Z

Summary of Changes

Hello @yzh119, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the flexibility and accessibility of the routerGemm and fused_topk_deepseek kernels by eliminating previous GPU compute capability restrictions. By removing architecture-specific decorators and integrating validation directly into the function logic, the kernels can now execute on a wider array of GPU hardware, leveraging standard CUDA and SM90+ PDL features. This change ensures broader compatibility without compromising the integrity of input validation.

Highlights

Removed Compute Capability Restrictions: The @supported_compute_capability and @backend_requirement decorators have been removed from routerGemm and fused_topk_deepseek kernels, allowing them to run on a broader range of GPU architectures.
Internal Validation: Shape and configuration validation calls were moved from decorators directly into the function bodies of the affected kernels, ensuring checks are still performed.
Broader GPU Support: Both kernels now utilize standard CUDA operations with SM90+ Persistent Device-side Launch (PDL) features, which are guarded by #if __CUDA_ARCH__ >= 900, making them compatible with all modern GPU architectures.
Test Suite Update: Router GEMM tests no longer skip on non-SM100 GPUs, enabling more comprehensive testing across different hardware.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

flashinfer/fused_moe/fused_routing_dsv3.py
- Removed supported_compute_capability and backend_requirement decorators.
- Moved the _check_dsv3_fused_routing_supported call into the fused_topk_deepseek function body.
- Updated docstrings to reflect broader SM90+ PDL support and removed architecture-specific optimization notes.
flashinfer/gemm/routergemm_dsv3.py
- Removed supported_compute_capability and backend_requirement decorators for mm_M1_16_K7168_N128 and mm_M1_16_K7168_N256.
- Moved shape check calls (_mm_M1_16_K7168_N128_shape_checks, _mm_M1_16_K7168_N256_shape_checks) into their respective function bodies.
- Updated docstrings to mention SM90+ PDL support and removed specific architecture specialization notes.
tests/model_optimizations/test_dsv3_router_gemm.py
- Removed the import of get_compute_capability.
- Eliminated pytest.skip conditions that previously restricted router GEMM tests to SM100 GPUs.

Activity

The pull request was created by yzh119. No further review comments or activities have been recorded yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

coderabbitai · 2026-02-18T02:30:44Z

📝 Walkthrough

Walkthrough

This PR migrates runtime capability and shape validation from decorator-based checks to explicit function calls in DSv3 fused routing and GEMM operations. Validation logic is moved inside functions rather than enforced at declaration-time via decorators.

Changes

Cohort / File(s)	Summary
Validation Migration in Routing and GEMM `flashinfer/fused_moe/fused_routing_dsv3.py`, `flashinfer/gemm/routergemm_dsv3.py`	Removed `@supported_compute_capability` and `@backend_requirement` decorators; added explicit runtime validation calls within fused_topk_deepseek and mm_M1_16_K7168_N\* functions. Updated docstrings to clarify SM90+ device-side launch support.
Test Suite Updates `tests/model_optimizations/test_dsv3_router_gemm.py`	Removed compute capability checks and conditional skips, allowing tests to run unconditionally on any CUDA device.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Rename noauxtc to fused_topk_deepseek #2181: Modifies the same fused_topk_deepseek function during a NoAuxTc → fused_topk_deepseek rename refactor.

Suggested labels

op: comm

Suggested reviewers

jiahanc
cyx-6
nvmbreughe
djmmoss

Poem

🐰 No more decorators crowding the way,
Validation now runs at call time each day!
From SM90+ heights, the kernels shall spring,
Explicit checks guard each computational thing.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 44.44% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title directly and clearly summarizes the main change: removing compute capability restrictions from two specific APIs (routerGemm and fused_topk_deepseek).
Description check	✅ Passed	The PR description includes a comprehensive summary of changes, detailed rationale, and test plan items, though some sections from the template (Pre-commit Checks, passing tests confirmation) are not fully addressed.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch remove-cc-restriction-router-gemm-topk

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request effectively removes the compute capability restrictions from routerGemm and fused_topk_deepseek APIs, which is a great improvement for broader hardware support. The changes are well-aligned with the description, replacing decorator-based checks with direct function calls and updating tests to run on more GPU architectures. The use of preprocessor guards in the CUDA code for architecture-specific features is correctly implemented.

I found one potential issue in the Python-side validation for fused_topk_deepseek where a check for the maximum number of experts in the multi-group case seems to be missing, which could lead to a runtime error. I've added a specific comment with a suggestion to address this.

Overall, this is a good change that improves usability and maintainability.

claude · 2026-02-18T02:34:27Z

Code Review

Summary

This PR removes @supported_compute_capability and @backend_requirement decorators from routerGemm (mm_M1_16_K7168_N128, mm_M1_16_K7168_N256) and fused_topk_deepseek APIs, replacing decorator-based validation with direct function calls in the body.

The rationale is sound: both kernels use standard CUDA operations with the only SM90+ feature being the optional PDL griddepcontrol instructions, which are already guarded by #if __CUDA_ARCH__ >= 900 at the PTX level.

Correctness Analysis

PDL host-side launch attribute (cudaLaunchAttributeProgrammaticStreamSerialization)

Both kernels unconditionally set this launch attribute regardless of launch_with_pdl value. When use_pdl=false, programmaticStreamSerializationAllowed = 0, which CUDA silently ignores on older GPUs. The PTX griddepcontrol instructions are compile-time guarded by #if __CUDA_ARCH__ >= 900. No correctness issue here.

fused_topk_deepseek defaults launch_with_pdl=True

On pre-SM90 GPUs (e.g. A100), the default launch_with_pdl=True will set programmaticStreamSerializationAllowed = 1 in the CUDA launch attributes. While functionally a no-op on older hardware (the PTX griddepcontrol is compile-guarded), it is semantically surprising and inconsistent with mm_M1_16_K7168_N128/N256 which both default to False. Consider either:

Changing the default to False for consistency, or
Adding an arch-aware default (e.g., auto-detect SM90+)

Design Considerations

Loss of is_compute_capability_supported / is_backend_supported introspection

Removing @backend_requirement means these APIs no longer expose .is_compute_capability_supported(cc) and .is_backend_supported(backend, cc) methods. Any external code using these introspection methods will silently break with AttributeError. Worth verifying whether downstream users or the benchmark framework rely on these.

Resolved TODOs

The removed # TODO: other compute capabilities may be supported but are untested comments were the previous justification for the SM100-only restriction. It would be good to document in the PR which non-SM100 architectures (e.g., A100, H100) were actually tested and verified.

Test Coverage

Test plan items are unchecked - Please update after verification on the target hardware.

No test for launch_with_pdl=True on non-SM90: There is no test that validates the PDL flag behavior on a non-Hopper GPU. Given this is the main behavioral change enabled by this PR, adding such a test or documenting verified hardware would be valuable.

Accuracy threshold: The test uses cos_sim > 0.99, which should hold on all targets, but confirming on actual A100/H100 runs before merge is important.

Documentation

The docstring changes are accurate and consistent — replacing Blackwell-specific claims with "PDL is SM90+ only" notes. Good.

Minor

from flashinfer.utils import (register_custom_op,) — trailing comma inside the parenthesized import is a harmless style nit.

Verdict

The core change is technically correct: these kernels work on pre-SM100 GPUs — the only SM90+ feature is PDL, which is conditionally compiled at the PTX level. The change removes unnecessary restrictions.

The main concern before merge is the launch_with_pdl=True default in fused_topk_deepseek: it is surprising UX on non-SM90 GPUs and inconsistent with the other two APIs that default to False. Verifying that no external callers depend on the removed .is_compute_capability_supported() introspection API is also worth doing.

Generated with Claude Code

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

flashinfer/fused_moe/fused_routing_dsv3.py (1)
42-42: ⚠️ Potential issue | 🟠 Major

Bug activated by this PR: incorrect guard condition blocks valid n_group=1 configurations.

This PR introduces the first call to _check_dsv3_fused_routing_supported (lines 180-190), making the previously dormant check at line 42 active. The condition topk_group * n_group < topk is semantically wrong for the n_group == 1 case:

With n_group=1, topk_group is forced to 1 (by the topk_group > n_group guard).

Then topk_group * n_group = 1 * 1 = 1, so any topk > 1 raises ValueError.

This directly contradicts the n_group == 1 branch at lines 67-75, which explicitly allows topk up to 8.

The formula should be topk > topk_group * (num_experts // n_group) (experts available in the selected groups), not topk > topk_group * n_group.
🐛 Proposed fix
-    if topk_group * n_group < topk or topk_group > n_group:
+    if topk > topk_group * (scores.shape[1] // n_group) or topk_group > n_group:
         raise ValueError(
-            f"Invalid configuration: topk_group * n_group ({topk_group * n_group}) must be >= topk ({topk}) "
+            f"Invalid configuration: topk_group * experts_per_group ({topk_group * (scores.shape[1] // n_group)}) must be >= topk ({topk}) "
             f"and topk_group ({topk_group}) must be <= n_group ({n_group})"
         )
Also applies to: 180-190
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/fused_moe/fused_routing_dsv3.py` at line 42, The guard at the top
of fused_routing_dsv3.py incorrectly uses topk_group * n_group to cap topk and
blocks valid n_group==1 configs; update the condition (and the duplicate check
inside _check_dsv3_fused_routing_supported) to compute available experts per
selected groups: replace "if topk_group * n_group < topk or topk_group >
n_group:" with a check that computes available = topk_group * (num_experts //
n_group) and then raises only if topk > available (and keep the existing
topk_group > n_group check if still desired), i.e., use topk > topk_group *
(num_experts // n_group) so topk is compared against the actual experts
available in the chosen groups.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@flashinfer/fused_moe/fused_routing_dsv3.py`:
- Line 42: The guard at the top of fused_routing_dsv3.py incorrectly uses
topk_group * n_group to cap topk and blocks valid n_group==1 configs; update the
condition (and the duplicate check inside _check_dsv3_fused_routing_supported)
to compute available experts per selected groups: replace "if topk_group *
n_group < topk or topk_group > n_group:" with a check that computes available =
topk_group * (num_experts // n_group) and then raises only if topk > available
(and keep the existing topk_group > n_group check if still desired), i.e., use
topk > topk_group * (num_experts // n_group) so topk is compared against the
actual experts available in the chosen groups.

yzh119 · 2026-02-18T20:58:35Z

/bot run

flashinfer-bot · 2026-02-18T20:59:28Z

GitLab MR !324 has been created, and the CI pipeline #44314251 is currently running. I'll report back once the pipeline job completes.

flashinfer-bot · 2026-02-18T23:47:40Z

[CANCELING] Pipeline #44314251: canceled

aleozlx

lgtm

jimmyzho

lgtm! ps. Continuing from our previous discussions on Support Checks: should we add a decorator to wrap a support check function to a specific API? Currently we have it bundled with @backend_requirement, I think decoupling that would make sense since many non-gemm interfaces do not have separate backends.

yzh119 requested review from bkryu, cyx-6, djmmoss, jiahanc, jimmyzho, nv-yunzheq, nvmbreughe and yongwww as code owners February 18, 2026 02:30

gemini-code-assist bot reviewed Feb 18, 2026

View reviewed changes

coderabbitai bot reviewed Feb 18, 2026

View reviewed changes

pavanimajety mentioned this pull request Feb 18, 2026

[ModelBash][DSV3] Add TRTLLM DSV3 Router GEMM kernel (6% B1 Speedup) vllm-project/vllm#34302

Merged

5 tasks

yzh119 mentioned this pull request Feb 19, 2026

feat: trtllm tinygemm2 in flashinfer as bf16 routergemm #2587

Merged

5 tasks

aleozlx approved these changes Feb 19, 2026

View reviewed changes

jimmyzho approved these changes Feb 19, 2026

View reviewed changes

Conversation

yzh119 commented Feb 18, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

gemini-code-assist bot commented Feb 18, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

coderabbitai bot commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

claude bot commented Feb 18, 2026

Code Review

Summary

Correctness Analysis

Design Considerations

Test Coverage

Documentation

Minor

Verdict

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

yzh119 commented Feb 18, 2026

Uh oh!

flashinfer-bot commented Feb 18, 2026

Uh oh!

flashinfer-bot commented Feb 18, 2026

Uh oh!

aleozlx left a comment

Choose a reason for hiding this comment

Uh oh!

jimmyzho left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yzh119 commented Feb 18, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 18, 2026 •

edited

Loading

jimmyzho left a comment •

edited

Loading