Remove compute capability restrictions from routerGemm and fused_topk_deepseek#2576
Remove compute capability restrictions from routerGemm and fused_topk_deepseek#2576
Conversation
…_deepseek Both kernels use only standard CUDA operations (warp shuffles, shared memory, cooperative_groups) with SM90+ PDL features properly guarded by #if __CUDA_ARCH__ >= 900. They work on all GPU architectures, not just the ones previously listed. - Remove @supported_compute_capability and @backend_requirement from routerGemm (was restricted to SM100 only) and fused_topk_deepseek (was restricted to SM89/90/100/103/120/121) - Call shape/config validation directly in the function body instead - Remove SM100-only pytest.skip in router GEMM tests - Update docstrings to remove Blackwell-specific language AI-assisted Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary of ChangesHello @yzh119, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the flexibility and accessibility of the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
📝 WalkthroughWalkthroughThis PR migrates runtime capability and shape validation from decorator-based checks to explicit function calls in DSv3 fused routing and GEMM operations. Validation logic is moved inside functions rather than enforced at declaration-time via decorators. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request effectively removes the compute capability restrictions from routerGemm and fused_topk_deepseek APIs, which is a great improvement for broader hardware support. The changes are well-aligned with the description, replacing decorator-based checks with direct function calls and updating tests to run on more GPU architectures. The use of preprocessor guards in the CUDA code for architecture-specific features is correctly implemented.
I found one potential issue in the Python-side validation for fused_topk_deepseek where a check for the maximum number of experts in the multi-group case seems to be missing, which could lead to a runtime error. I've added a specific comment with a suggestion to address this.
Overall, this is a good change that improves usability and maintainability.
Code ReviewSummaryThis PR removes The rationale is sound: both kernels use standard CUDA operations with the only SM90+ feature being the optional PDL Correctness AnalysisPDL host-side launch attribute ( Both kernels unconditionally set this launch attribute regardless of
On pre-SM90 GPUs (e.g. A100), the default
Design ConsiderationsLoss of Removing Resolved TODOs The removed Test CoverageTest plan items are unchecked - Please update after verification on the target hardware. No test for Accuracy threshold: The test uses DocumentationThe docstring changes are accurate and consistent — replacing Blackwell-specific claims with "PDL is SM90+ only" notes. Good. Minor
VerdictThe core change is technically correct: these kernels work on pre-SM100 GPUs — the only SM90+ feature is PDL, which is conditionally compiled at the PTX level. The change removes unnecessary restrictions. The main concern before merge is the Generated with Claude Code |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
flashinfer/fused_moe/fused_routing_dsv3.py (1)
42-42:⚠️ Potential issue | 🟠 MajorBug activated by this PR: incorrect guard condition blocks valid
n_group=1configurations.This PR introduces the first call to
_check_dsv3_fused_routing_supported(lines 180-190), making the previously dormant check at line 42 active. The conditiontopk_group * n_group < topkis semantically wrong for then_group == 1case:
- With
n_group=1,topk_groupis forced to1(by thetopk_group > n_groupguard).- Then
topk_group * n_group = 1 * 1 = 1, so anytopk > 1raisesValueError.- This directly contradicts the
n_group == 1branch at lines 67-75, which explicitly allowstopkup to 8.The formula should be
topk > topk_group * (num_experts // n_group)(experts available in the selected groups), nottopk > topk_group * n_group.🐛 Proposed fix
- if topk_group * n_group < topk or topk_group > n_group: + if topk > topk_group * (scores.shape[1] // n_group) or topk_group > n_group: raise ValueError( - f"Invalid configuration: topk_group * n_group ({topk_group * n_group}) must be >= topk ({topk}) " + f"Invalid configuration: topk_group * experts_per_group ({topk_group * (scores.shape[1] // n_group)}) must be >= topk ({topk}) " f"and topk_group ({topk_group}) must be <= n_group ({n_group})" )Also applies to: 180-190
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@flashinfer/fused_moe/fused_routing_dsv3.py` at line 42, The guard at the top of fused_routing_dsv3.py incorrectly uses topk_group * n_group to cap topk and blocks valid n_group==1 configs; update the condition (and the duplicate check inside _check_dsv3_fused_routing_supported) to compute available experts per selected groups: replace "if topk_group * n_group < topk or topk_group > n_group:" with a check that computes available = topk_group * (num_experts // n_group) and then raises only if topk > available (and keep the existing topk_group > n_group check if still desired), i.e., use topk > topk_group * (num_experts // n_group) so topk is compared against the actual experts available in the chosen groups.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@flashinfer/fused_moe/fused_routing_dsv3.py`:
- Line 42: The guard at the top of fused_routing_dsv3.py incorrectly uses
topk_group * n_group to cap topk and blocks valid n_group==1 configs; update the
condition (and the duplicate check inside _check_dsv3_fused_routing_supported)
to compute available experts per selected groups: replace "if topk_group *
n_group < topk or topk_group > n_group:" with a check that computes available =
topk_group * (num_experts // n_group) and then raises only if topk > available
(and keep the existing topk_group > n_group check if still desired), i.e., use
topk > topk_group * (num_experts // n_group) so topk is compared against the
actual experts available in the chosen groups.
|
/bot run |
|
[CANCELING] Pipeline #44314251: canceled |
There was a problem hiding this comment.
lgtm! ps. Continuing from our previous discussions on Support Checks: should we add a decorator to wrap a support check function to a specific API? Currently we have it bundled with @backend_requirement, I think decoupling that would make sense since many non-gemm interfaces do not have separate backends.
Summary
@supported_compute_capabilityand@backend_requirementdecorators fromrouterGemm(mm_M1_16_K7168_N128,mm_M1_16_K7168_N256) andfused_topk_deepseekAPIs#if __CUDA_ARCH__ >= 900, so they work on all GPU architecturesrouterGemmwas previously restricted to SM100 only;fused_topk_deepseekwas restricted to SM89/90/100/103/120/121Test plan
pytest tests/model_optimizations/test_dsv3_router_gemm.pyon non-SM100 GPU (e.g. A100, H100)pytest tests/model_optimizations/test_dsv3_fused_routing.pyon non-SM100 GPU🤖 Generated with Claude Code
Summary by CodeRabbit
Refactor
Tests