fix(sm12x): fix micro-kernel workspace sizing when routed_rows > num_local_experts by meena-at-work · Pull Request #3191 · flashinfer-ai/flashinfer

meena-at-work · 2026-04-27T20:47:28Z

Summary

Two bugs in the b12x_fused_moe micro-kernel path (SM120/SM121, triggered when routed_rows <= micro_cutover, typically 20–40):

allocate_sm120_static_workspace (moe_dispatch.py): compact_topk_ids was sized state_E (num local experts), but the micro-kernel fills it with flat_ids of length routed_rows = num_tokens * num_topk. When num_tokens * num_topk > num_local_experts (e.g. 2 tokens × 8 topk = 16 pairs, 8 local experts), this caused an assertion failure: compact_topk_ids buffer too small: 8 < 16. Fix: size as max(state_E, max_rows).
compact_topk_ids (triton_compact.py): validation required weight_expert_ids.numel() >= total_pairs. This was wrong — the Triton kernel writes to weight_expert_ids only at indices 0..active_expert_count-1, bounded by state_E (num
unique active experts), not total_pairs. The check rejected valid calls where total_pairs > state_E. Fix: remove the check (with explanatory comment).

Both bugs surface together whenever the batch is small enough to hit the micro-kernel path but num_tokens * num_topk > num_local_experts.

Test plan

Unit tests in tests/kernels/moe/test_flashinfer_b12x_moe.py cover the small-batch micro-kernel path — 24/24 pass with this fix on DGX Spark (SM121)

Summary by CodeRabbit

Bug Fixes
- Increased pre-pass compaction workspace to prevent micro-kernel buffer overruns and related runtime assertions.
Documentation
- Clarified compaction behavior: the compacting kernel writes only up to the active expert count, so the prior strict size validation for the weight-expert ID buffer was removed.
Tests
- Added regression tests for cases where token count × top-k exceeds local expert count.

…local_experts Two bugs in the SM12x b12x MoE micro-kernel path (triggered when routed_rows <= micro_cutover, ~20-40): 1. allocate_sm120_static_workspace: compact_topk_ids was sized state_E, but the micro-kernel path passes flat_ids of length routed_rows (= num_tokens * num_topk), which can exceed num_local_experts (state_E) for small batch sizes. Fix: size as max(state_E, max_rows). 2. compact_topk_ids (triton_compact.py): validation check required weight_expert_ids.numel() >= total_pairs. This was wrong — the kernel writes to weight_expert_ids at indices 0..active_expert_count-1, which is bounded by state_E (num local experts), not total_pairs. The check incorrectly rejected valid calls where total_pairs > state_E. Fix: remove the check. Together these caused an assertion failure whenever num_tokens * num_topk > num_local_experts at micro-kernel batch sizes (e.g. 2 tokens * 8 topk = 16 pairs but only 8 local experts). Signed-off-by: Meenakshi Venkataraman <meenakshiv@nvidia.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

meena-at-work · 2026-04-27T20:48:03Z

cc: @bkryu -- can you please review this PR?

coderabbitai · 2026-04-27T20:48:50Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

The pre-pass compaction backing storage for compact_topk_ids is allocated using max(state_E, max_rows) instead of state_E, and a runtime size check for weight_expert_ids was removed with a clarifying comment about Triton writing only to indices 0..active_expert_count-1. A new micro-kernel regression test covering cases where num_tokens * top_k > num_local_experts was added.

Changes

MoE micro-kernel / compaction flow

Layer / File(s)	Summary
Allocation / Data shape `flashinfer/fused_moe/cute_dsl/blackwell_sm12x/moe_dispatch.py`	Pre-pass compaction buffer `compact_topk_ids` is allocated with `max(state_E, max_rows)` (increases host-visible backing storage).
Core kernel contract / validation `flashinfer/fused_moe/cute_dsl/blackwell_sm12x/triton_compact.py`	Removed runtime `ValueError` check ensuring `weight_expert_ids` length ≥ `total_pairs`; added comment that Triton writes `weight_expert_ids` only for `0..active_expert_count-1` (bounded by local experts).
Tests / Regression `tests/moe/test_b12x_fused_moe.py`	Add parameterized test `test_micro_pairs_exceed_local_experts` to exercise micro-kernel path where `num_tokens * top_k > num_local_experts`; asserts shapes, finiteness, and numeric agreement with reference within tolerance.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

run-ci

Suggested reviewers

yzh119
samuellees
IwakuraRein
jiahanc
nv-yunzheq
bkryu
aleozlx

Poem

🐰 I grew a buffer, roomy and neat,
So tiny top-k feet find space to meet.
Triton scrawls only where it's told,
Tests hop by, no assert to scold. 🥕

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the primary fix: addressing micro-kernel workspace sizing when routed_rows exceeds num_local_experts, which is the core issue addressed in both code changes.
Description check	✅ Passed	The description provides detailed explanations of both bugs, their root causes, fixes, and test coverage, but the PR description template checklist items remain unchecked and some pre-commit verification details are incomplete.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request adjusts the allocation size for compact_topk_ids in the MoE dispatch logic and removes an incorrect validation check in the Triton compacting function. Feedback was provided to improve the clarity of a code comment by avoiding the use of a variable name that is not defined within the local scope.

Signed-off-by: Meenakshi Venkataraman <meenakshiv@nvidia.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace state_E (not in scope here) with 'the number of local experts'. Signed-off-by: Meenakshi Venkataraman <meenakshiv@nvidia.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai

🧹 Nitpick comments (1)

flashinfer/fused_moe/cute_dsl/blackwell_sm12x/moe_dispatch.py (1)
142-142: Refresh the stale field-level comment.

The dataclass annotation still says # [state_E] int32, for micro kernel pre-pass, but the buffer is now sized to max(state_E, max_rows). Worth updating to avoid confusing future readers about the actual capacity contract.
📝 Suggested update
-    compact_topk_ids: torch.Tensor  # [state_E] int32, for micro kernel pre-pass
+    compact_topk_ids: torch.Tensor  # [max(state_E, max_rows)] int32, for micro kernel pre-pass
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/fused_moe/cute_dsl/blackwell_sm12x/moe_dispatch.py` at line 142,
Update the stale field comment for the dataclass field compact_topk_ids to
reflect its current capacity contract: note that the buffer is sized to
max(state_E, max_rows) (still int32) and clarify it's used for the micro-kernel
pre-pass / compact top-k indices; locate the compact_topk_ids declaration in
moe_dispatch.py and replace the old “[state_E] int32, for micro kernel pre-pass”
comment with a concise comment such as “# [max(state_E, max_rows)] int32, for
micro-kernel pre-pass (compact top-k indices)”.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@flashinfer/fused_moe/cute_dsl/blackwell_sm12x/moe_dispatch.py`:
- Line 142: Update the stale field comment for the dataclass field
compact_topk_ids to reflect its current capacity contract: note that the buffer
is sized to max(state_E, max_rows) (still int32) and clarify it's used for the
micro-kernel pre-pass / compact top-k indices; locate the compact_topk_ids
declaration in moe_dispatch.py and replace the old “[state_E] int32, for micro
kernel pre-pass” comment with a concise comment such as “# [max(state_E,
max_rows)] int32, for micro-kernel pre-pass (compact top-k indices)”.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 060e40a7-e392-4ab9-a9ef-eb8b9cf351c6

📥 Commits

Reviewing files that changed from the base of the PR and between d0d7a10 and 27d763e.

📒 Files selected for processing (1)

flashinfer/fused_moe/cute_dsl/blackwell_sm12x/moe_dispatch.py

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@flashinfer/fused_moe/cute_dsl/blackwell_sm12x/triton_compact.py`:
- Around line 71-72: The docstring describing the expected size for
weight_expert_ids is stale: it currently claims weight_expert_ids must be length
>= total_pairs but the code now only writes indices 0..active_expert_count-1
bounded by the number of local experts. Update the docstring near the
weight_expert_ids parameter (the docstring around line ~62 in triton_compact.py)
to state that weight_expert_ids needs to be sized to accommodate
active_expert_count (or the number of local experts) rather than total_pairs,
and clarify that no full total_pairs-sized buffer is required because writes are
limited to 0..active_expert_count-1.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 671bfcab-f71e-4f41-aab4-30ec6f6e645b

📥 Commits

Reviewing files that changed from the base of the PR and between 27d763e and fb0870d.

📒 Files selected for processing (1)

flashinfer/fused_moe/cute_dsl/blackwell_sm12x/triton_compact.py

coderabbitai · 2026-04-27T21:52:32Z

+    # weight_expert_ids writes at indices 0..active_expert_count-1 (bounded by
+    # the number of local experts, not total_pairs), so no size check is needed here.


⚠️ Potential issue | 🟡 Minor

Docstring contract is now stale after removing the size check.

Line 62 still says weight_expert_ids must be [>=total_pairs], but Lines 71-72 explicitly relax that. Please update the docstring to reflect the new expected sizing contract.

Suggested doc fix

- weight_expert_ids: [>=total_pairs] int32 — output: local->global map. + weight_expert_ids: int32 — output: local->global map; size must cover + the maximum number of unique experts expected in `topk_ids`.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@flashinfer/fused_moe/cute_dsl/blackwell_sm12x/triton_compact.py` around lines 71 - 72, The docstring describing the expected size for weight_expert_ids is stale: it currently claims weight_expert_ids must be length >= total_pairs but the code now only writes indices 0..active_expert_count-1 bounded by the number of local experts. Update the docstring near the weight_expert_ids parameter (the docstring around line ~62 in triton_compact.py) to state that weight_expert_ids needs to be sized to accommodate active_expert_count (or the number of local experts) rather than total_pairs, and clarify that no full total_pairs-sized buffer is required because writes are limited to 0..active_expert_count-1.

…um_local_experts Adds test_micro_pairs_exceed_local_experts with three configurations where num_tokens * top_k > num_local_experts, directly exercising the bug fixed by sizing compact_topk_ids as max(state_E, max_rows). Signed-off-by: Meenakshi Venkataraman <meenakshiv@nvidia.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/moe/test_b12x_fused_moe.py`:
- Around line 1070-1140: The new regression test function
test_micro_pairs_exceed_local_experts in the tests/moe/test_b12x_fused_moe.py
file was auto-reformatted by ruff-format in CI; commit the formatter's output so
the repo matches CI (prevent pre-commit/CI failures). Run ruff-format (or your
project's formatting command) on the changed hunk containing
test_micro_pairs_exceed_local_experts and add the resulting changes to the same
patch/PR, then re-run tests to ensure the committed formatting is the only
change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a81dfaeb-9bb9-4079-a832-e8c5691c6723

📥 Commits

Reviewing files that changed from the base of the PR and between fb0870d and 224c441.

📒 Files selected for processing (1)

tests/moe/test_b12x_fused_moe.py

kahyunnam · 2026-05-04T23:27:29Z

/bot run

flashinfer-bot · 2026-05-04T23:27:46Z

GitLab MR !626 has been created, and the CI pipeline #50264407 is currently running. I'll report back once the pipeline job completes.

aleozlx · 2026-05-04T23:59:48Z

pls address pre-commit

Signed-off-by: Meenakshi Venkataraman <meenakshiv@nvidia.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

tests/moe/test_b12x_fused_moe.py (1)
1070-1077: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Outstanding pre-commit formatting failure — ruff-format still needs to be committed.

A previous CI run shows ruff-format auto-modified this exact hunk and the reformatted version was not committed. This will continue to fail the pre-commit gate on every subsequent CI run until the formatter output is checked in.

Run pre-commit run --all-files (or ruff format tests/moe/test_b12x_fused_moe.py) locally and commit the result.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/moe/test_b12x_fused_moe.py` around lines 1070 - 1077, The failing
pre-commit check is due to uncommitted changes from ruff-format for the
parametrized test decorator; run the formatter and commit the updated hunk.
Locally run `pre-commit run --all-files` or `ruff format
tests/moe/test_b12x_fused_moe.py` to apply ruff formatting to the
pytest.mark.parametrize block (the decorator specifying
"num_tokens,top_k,num_experts"), verify the diff, and commit the formatted file
so the pre-commit gate passes.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/moe/test_b12x_fused_moe.py`:
- Around line 1073-1076: The test case tuple (4, 8, 16) in the parameter list
can produce routed_rows=32 which may not hit the micro-kernel path; replace that
tuple with (2, 8, 16) so routed_rows becomes 16 (< typical micro_cutover) to
reliably exercise the micro-kernel path in the test (update the tuple in the
list of test cases in test_b12x_fused_moe.py).

---

Duplicate comments:
In `@tests/moe/test_b12x_fused_moe.py`:
- Around line 1070-1077: The failing pre-commit check is due to uncommitted
changes from ruff-format for the parametrized test decorator; run the formatter
and commit the updated hunk. Locally run `pre-commit run --all-files` or `ruff
format tests/moe/test_b12x_fused_moe.py` to apply ruff formatting to the
pytest.mark.parametrize block (the decorator specifying
"num_tokens,top_k,num_experts"), verify the diff, and commit the formatted file
so the pre-commit gate passes.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 52077f96-6d43-45a9-b275-702ce81d6c77

📥 Commits

Reviewing files that changed from the base of the PR and between 224c441 and 4e76519.

📒 Files selected for processing (1)

tests/moe/test_b12x_fused_moe.py

coderabbitai · 2026-05-06T22:00:59Z

+            (2, 8, 8),  # total_pairs=16 > num_local_experts=8
+            (4, 8, 16),  # total_pairs=32 > num_local_experts=16
+            (4, 4, 8),  # total_pairs=16 > num_local_experts=8
+        ],


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Test case (4, 8, 16) may not reliably exercise the micro-kernel path.

routed_rows = num_tokens * top_k = 4 * 8 = 32. The micro-kernel is selected only when routed_rows <= micro_cutover, which the PR describes as "typically 20–40". If micro_cutover is 20 for the target hardware/configuration, this case silently falls through to the standard path and does not exercise the regression being fixed.

Consider replacing (4, 8, 16) with a case whose routed_rows is safely within the guaranteed micro cutover range — e.g. (2, 8, 16) gives routed_rows=16 < 20.

🔧 Suggested replacement

- (4, 8, 16), # total_pairs=32 > num_local_experts=16 + (2, 8, 16), # total_pairs=16 > num_local_experts=16, routed_rows safely within micro cutover

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

(2, 8, 8), # total_pairs=16 > num_local_experts=8

(4, 8, 16), # total_pairs=32 > num_local_experts=16

(4, 4, 8), # total_pairs=16 > num_local_experts=8

],

(2, 8, 8), # total_pairs=16 > num_local_experts=8

(2, 8, 16), # total_pairs=16 > num_local_experts=16, routed_rows safely within micro cutover

(4, 4, 8), # total_pairs=16 > num_local_experts=8

],

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/moe/test_b12x_fused_moe.py` around lines 1073 - 1076, The test case tuple (4, 8, 16) in the parameter list can produce routed_rows=32 which may not hit the micro-kernel path; replace that tuple with (2, 8, 16) so routed_rows becomes 16 (< typical micro_cutover) to reliably exercise the micro-kernel path in the test (update the tuple in the list of test cases in test_b12x_fused_moe.py).

meena-at-work requested review from IwakuraRein, aleozlx, jiahanc, nv-yunzheq, samuellees and yzh119 as code owners April 27, 2026 20:47

flashinfer-bot added the op: moe label Apr 27, 2026

gemini-code-assist Bot reviewed Apr 27, 2026

View reviewed changes

Comment thread flashinfer/fused_moe/cute_dsl/blackwell_sm12x/triton_compact.py Outdated

meena-at-work and others added 2 commits April 27, 2026 21:46

style: ruff format compact_topk_ids line in moe_dispatch.py

27d763e

Signed-off-by: Meenakshi Venkataraman <meenakshiv@nvidia.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs: clarify weight_expert_ids comment in compact_topk_ids

fb0870d

Replace state_E (not in scope here) with 'the number of local experts'. Signed-off-by: Meenakshi Venkataraman <meenakshiv@nvidia.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai Bot reviewed Apr 27, 2026

View reviewed changes

Comment thread tests/moe/test_b12x_fused_moe.py

bkryu mentioned this pull request Apr 27, 2026

Using FlashInfer CUTLASS Backend for vLLM is Slow on SM120/121 #3013

Closed

AethoceSora mentioned this pull request Apr 29, 2026

[CI/Build] chore(deps): bump flashinfer to v0.6.11 vllm-project/vllm#40998

Open

1 task

Merge branch 'main' into fix/b12x-moe-micro-kernel-workspace-sizing

957cea7

aleozlx approved these changes May 4, 2026

View reviewed changes

Apply ruff-format to test_b12x_fused_moe.py

4e76519

Signed-off-by: Meenakshi Venkataraman <meenakshiv@nvidia.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai Bot reviewed May 6, 2026

View reviewed changes

kahyunnam enabled auto-merge (squash) May 6, 2026 22:02

kahyunnam added the v0.6.11 release blocker label for 0.6.11 label May 6, 2026

aleozlx added the run-ci label May 6, 2026

kahyunnam merged commit 14f2bee into flashinfer-ai:main May 7, 2026
83 of 102 checks passed

coderabbitai Bot mentioned this pull request May 8, 2026

feat(moe): add SM120 W4A16 b12x kernels #3271

Merged

5 tasks

AethoceSora mentioned this pull request May 8, 2026

Integrate flashinfer b12x MoE and FP4 GEMM kernels for SM120/121 vllm-project/vllm#40082

Open

		# weight_expert_ids writes at indices 0..active_expert_count-1 (bounded by
		# the number of local experts, not total_pairs), so no size check is needed here.

Conversation

meena-at-work commented Apr 27, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

meena-at-work commented Apr 27, 2026

Uh oh!

coderabbitai Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kahyunnam commented May 4, 2026

Uh oh!

flashinfer-bot commented May 4, 2026

Uh oh!

aleozlx commented May 4, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

meena-at-work commented Apr 27, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 27, 2026 •

edited

Loading