Skip to content

fix(sm12x): fix micro-kernel workspace sizing when routed_rows > num_local_experts#3191

Merged
kahyunnam merged 6 commits into
flashinfer-ai:mainfrom
meena-at-work:fix/b12x-moe-micro-kernel-workspace-sizing
May 7, 2026
Merged

fix(sm12x): fix micro-kernel workspace sizing when routed_rows > num_local_experts#3191
kahyunnam merged 6 commits into
flashinfer-ai:mainfrom
meena-at-work:fix/b12x-moe-micro-kernel-workspace-sizing

Conversation

@meena-at-work
Copy link
Copy Markdown
Contributor

@meena-at-work meena-at-work commented Apr 27, 2026

Summary

Two bugs in the b12x_fused_moe micro-kernel path (SM120/SM121, triggered when routed_rows <= micro_cutover, typically 20–40):

  • allocate_sm120_static_workspace (moe_dispatch.py): compact_topk_ids was sized state_E (num local experts), but the micro-kernel fills it with flat_ids of length routed_rows = num_tokens * num_topk. When num_tokens * num_topk > num_local_experts (e.g. 2 tokens × 8 topk = 16 pairs, 8 local experts), this caused an assertion failure: compact_topk_ids buffer too small: 8 < 16. Fix: size as max(state_E, max_rows).

  • compact_topk_ids (triton_compact.py): validation required weight_expert_ids.numel() >= total_pairs. This was wrong — the Triton kernel writes to weight_expert_ids only at indices 0..active_expert_count-1, bounded by state_E (num
    unique active experts), not total_pairs. The check rejected valid calls where total_pairs > state_E. Fix: remove the check (with explanatory comment).

Both bugs surface together whenever the batch is small enough to hit the micro-kernel path but num_tokens * num_topk > num_local_experts.

Test plan

  • Unit tests in tests/kernels/moe/test_flashinfer_b12x_moe.py cover the small-batch micro-kernel path — 24/24 pass with this fix on DGX Spark (SM121)

Summary by CodeRabbit

  • Bug Fixes

    • Increased pre-pass compaction workspace to prevent micro-kernel buffer overruns and related runtime assertions.
  • Documentation

    • Clarified compaction behavior: the compacting kernel writes only up to the active expert count, so the prior strict size validation for the weight-expert ID buffer was removed.
  • Tests

    • Added regression tests for cases where token count × top-k exceeds local expert count.

…local_experts

Two bugs in the SM12x b12x MoE micro-kernel path (triggered when
routed_rows <= micro_cutover, ~20-40):

1. allocate_sm120_static_workspace: compact_topk_ids was sized state_E,
   but the micro-kernel path passes flat_ids of length routed_rows (=
   num_tokens * num_topk), which can exceed num_local_experts (state_E)
   for small batch sizes. Fix: size as max(state_E, max_rows).

2. compact_topk_ids (triton_compact.py): validation check required
   weight_expert_ids.numel() >= total_pairs. This was wrong — the kernel
   writes to weight_expert_ids at indices 0..active_expert_count-1, which
   is bounded by state_E (num local experts), not total_pairs. The check
   incorrectly rejected valid calls where total_pairs > state_E. Fix:
   remove the check.

Together these caused an assertion failure whenever num_tokens * num_topk
> num_local_experts at micro-kernel batch sizes (e.g. 2 tokens * 8 topk =
16 pairs but only 8 local experts).

Signed-off-by: Meenakshi Venkataraman <meenakshiv@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@meena-at-work
Copy link
Copy Markdown
Contributor Author

cc: @bkryu -- can you please review this PR?

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 27, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

The pre-pass compaction backing storage for compact_topk_ids is allocated using max(state_E, max_rows) instead of state_E, and a runtime size check for weight_expert_ids was removed with a clarifying comment about Triton writing only to indices 0..active_expert_count-1. A new micro-kernel regression test covering cases where num_tokens * top_k > num_local_experts was added.

Changes

MoE micro-kernel / compaction flow

Layer / File(s) Summary
Allocation / Data shape
flashinfer/fused_moe/cute_dsl/blackwell_sm12x/moe_dispatch.py
Pre-pass compaction buffer compact_topk_ids is allocated with max(state_E, max_rows) (increases host-visible backing storage).
Core kernel contract / validation
flashinfer/fused_moe/cute_dsl/blackwell_sm12x/triton_compact.py
Removed runtime ValueError check ensuring weight_expert_ids length ≥ total_pairs; added comment that Triton writes weight_expert_ids only for 0..active_expert_count-1 (bounded by local experts).
Tests / Regression
tests/moe/test_b12x_fused_moe.py
Add parameterized test test_micro_pairs_exceed_local_experts to exercise micro-kernel path where num_tokens * top_k > num_local_experts; asserts shapes, finiteness, and numeric agreement with reference within tolerance.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

run-ci

Suggested reviewers

  • yzh119
  • samuellees
  • IwakuraRein
  • jiahanc
  • nv-yunzheq
  • bkryu
  • aleozlx

Poem

🐰 I grew a buffer, roomy and neat,
So tiny top-k feet find space to meet.
Triton scrawls only where it's told,
Tests hop by, no assert to scold. 🥕

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the primary fix: addressing micro-kernel workspace sizing when routed_rows exceeds num_local_experts, which is the core issue addressed in both code changes.
Description check ✅ Passed The description provides detailed explanations of both bugs, their root causes, fixes, and test coverage, but the PR description template checklist items remain unchecked and some pre-commit verification details are incomplete.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adjusts the allocation size for compact_topk_ids in the MoE dispatch logic and removes an incorrect validation check in the Triton compacting function. Feedback was provided to improve the clarity of a code comment by avoiding the use of a variable name that is not defined within the local scope.

Comment thread flashinfer/fused_moe/cute_dsl/blackwell_sm12x/triton_compact.py Outdated
meena-at-work and others added 2 commits April 27, 2026 21:46
Signed-off-by: Meenakshi Venkataraman <meenakshiv@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace state_E (not in scope here) with 'the number of local experts'.

Signed-off-by: Meenakshi Venkataraman <meenakshiv@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
flashinfer/fused_moe/cute_dsl/blackwell_sm12x/moe_dispatch.py (1)

142-142: Refresh the stale field-level comment.

The dataclass annotation still says # [state_E] int32, for micro kernel pre-pass, but the buffer is now sized to max(state_E, max_rows). Worth updating to avoid confusing future readers about the actual capacity contract.

📝 Suggested update
-    compact_topk_ids: torch.Tensor  # [state_E] int32, for micro kernel pre-pass
+    compact_topk_ids: torch.Tensor  # [max(state_E, max_rows)] int32, for micro kernel pre-pass
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/fused_moe/cute_dsl/blackwell_sm12x/moe_dispatch.py` at line 142,
Update the stale field comment for the dataclass field compact_topk_ids to
reflect its current capacity contract: note that the buffer is sized to
max(state_E, max_rows) (still int32) and clarify it's used for the micro-kernel
pre-pass / compact top-k indices; locate the compact_topk_ids declaration in
moe_dispatch.py and replace the old “[state_E] int32, for micro kernel pre-pass”
comment with a concise comment such as “# [max(state_E, max_rows)] int32, for
micro-kernel pre-pass (compact top-k indices)”.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@flashinfer/fused_moe/cute_dsl/blackwell_sm12x/moe_dispatch.py`:
- Line 142: Update the stale field comment for the dataclass field
compact_topk_ids to reflect its current capacity contract: note that the buffer
is sized to max(state_E, max_rows) (still int32) and clarify it's used for the
micro-kernel pre-pass / compact top-k indices; locate the compact_topk_ids
declaration in moe_dispatch.py and replace the old “[state_E] int32, for micro
kernel pre-pass” comment with a concise comment such as “# [max(state_E,
max_rows)] int32, for micro-kernel pre-pass (compact top-k indices)”.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 060e40a7-e392-4ab9-a9ef-eb8b9cf351c6

📥 Commits

Reviewing files that changed from the base of the PR and between d0d7a10 and 27d763e.

📒 Files selected for processing (1)
  • flashinfer/fused_moe/cute_dsl/blackwell_sm12x/moe_dispatch.py

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@flashinfer/fused_moe/cute_dsl/blackwell_sm12x/triton_compact.py`:
- Around line 71-72: The docstring describing the expected size for
weight_expert_ids is stale: it currently claims weight_expert_ids must be length
>= total_pairs but the code now only writes indices 0..active_expert_count-1
bounded by the number of local experts. Update the docstring near the
weight_expert_ids parameter (the docstring around line ~62 in triton_compact.py)
to state that weight_expert_ids needs to be sized to accommodate
active_expert_count (or the number of local experts) rather than total_pairs,
and clarify that no full total_pairs-sized buffer is required because writes are
limited to 0..active_expert_count-1.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 671bfcab-f71e-4f41-aab4-30ec6f6e645b

📥 Commits

Reviewing files that changed from the base of the PR and between 27d763e and fb0870d.

📒 Files selected for processing (1)
  • flashinfer/fused_moe/cute_dsl/blackwell_sm12x/triton_compact.py

Comment on lines +71 to +72
# weight_expert_ids writes at indices 0..active_expert_count-1 (bounded by
# the number of local experts, not total_pairs), so no size check is needed here.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Docstring contract is now stale after removing the size check.

Line 62 still says weight_expert_ids must be [>=total_pairs], but Lines 71-72 explicitly relax that. Please update the docstring to reflect the new expected sizing contract.

Suggested doc fix
-        weight_expert_ids: [>=total_pairs] int32 — output: local->global map.
+        weight_expert_ids: int32 — output: local->global map; size must cover
+            the maximum number of unique experts expected in `topk_ids`.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/fused_moe/cute_dsl/blackwell_sm12x/triton_compact.py` around lines
71 - 72, The docstring describing the expected size for weight_expert_ids is
stale: it currently claims weight_expert_ids must be length >= total_pairs but
the code now only writes indices 0..active_expert_count-1 bounded by the number
of local experts. Update the docstring near the weight_expert_ids parameter (the
docstring around line ~62 in triton_compact.py) to state that weight_expert_ids
needs to be sized to accommodate active_expert_count (or the number of local
experts) rather than total_pairs, and clarify that no full total_pairs-sized
buffer is required because writes are limited to 0..active_expert_count-1.

…um_local_experts

Adds test_micro_pairs_exceed_local_experts with three configurations where
num_tokens * top_k > num_local_experts, directly exercising the bug fixed
by sizing compact_topk_ids as max(state_E, max_rows).

Signed-off-by: Meenakshi Venkataraman <meenakshiv@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/moe/test_b12x_fused_moe.py`:
- Around line 1070-1140: The new regression test function
test_micro_pairs_exceed_local_experts in the tests/moe/test_b12x_fused_moe.py
file was auto-reformatted by ruff-format in CI; commit the formatter's output so
the repo matches CI (prevent pre-commit/CI failures). Run ruff-format (or your
project's formatting command) on the changed hunk containing
test_micro_pairs_exceed_local_experts and add the resulting changes to the same
patch/PR, then re-run tests to ensure the committed formatting is the only
change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a81dfaeb-9bb9-4079-a832-e8c5691c6723

📥 Commits

Reviewing files that changed from the base of the PR and between fb0870d and 224c441.

📒 Files selected for processing (1)
  • tests/moe/test_b12x_fused_moe.py

Comment thread tests/moe/test_b12x_fused_moe.py
@kahyunnam
Copy link
Copy Markdown
Member

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !626 has been created, and the CI pipeline #50264407 is currently running. I'll report back once the pipeline job completes.

@aleozlx
Copy link
Copy Markdown
Collaborator

aleozlx commented May 4, 2026

pls address pre-commit

Signed-off-by: Meenakshi Venkataraman <meenakshiv@nvidia.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
tests/moe/test_b12x_fused_moe.py (1)

1070-1077: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Outstanding pre-commit formatting failure — ruff-format still needs to be committed.

A previous CI run shows ruff-format auto-modified this exact hunk and the reformatted version was not committed. This will continue to fail the pre-commit gate on every subsequent CI run until the formatter output is checked in.

Run pre-commit run --all-files (or ruff format tests/moe/test_b12x_fused_moe.py) locally and commit the result.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/moe/test_b12x_fused_moe.py` around lines 1070 - 1077, The failing
pre-commit check is due to uncommitted changes from ruff-format for the
parametrized test decorator; run the formatter and commit the updated hunk.
Locally run `pre-commit run --all-files` or `ruff format
tests/moe/test_b12x_fused_moe.py` to apply ruff formatting to the
pytest.mark.parametrize block (the decorator specifying
"num_tokens,top_k,num_experts"), verify the diff, and commit the formatted file
so the pre-commit gate passes.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/moe/test_b12x_fused_moe.py`:
- Around line 1073-1076: The test case tuple (4, 8, 16) in the parameter list
can produce routed_rows=32 which may not hit the micro-kernel path; replace that
tuple with (2, 8, 16) so routed_rows becomes 16 (< typical micro_cutover) to
reliably exercise the micro-kernel path in the test (update the tuple in the
list of test cases in test_b12x_fused_moe.py).

---

Duplicate comments:
In `@tests/moe/test_b12x_fused_moe.py`:
- Around line 1070-1077: The failing pre-commit check is due to uncommitted
changes from ruff-format for the parametrized test decorator; run the formatter
and commit the updated hunk. Locally run `pre-commit run --all-files` or `ruff
format tests/moe/test_b12x_fused_moe.py` to apply ruff formatting to the
pytest.mark.parametrize block (the decorator specifying
"num_tokens,top_k,num_experts"), verify the diff, and commit the formatted file
so the pre-commit gate passes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 52077f96-6d43-45a9-b275-702ce81d6c77

📥 Commits

Reviewing files that changed from the base of the PR and between 224c441 and 4e76519.

📒 Files selected for processing (1)
  • tests/moe/test_b12x_fused_moe.py

Comment on lines +1073 to +1076
(2, 8, 8), # total_pairs=16 > num_local_experts=8
(4, 8, 16), # total_pairs=32 > num_local_experts=16
(4, 4, 8), # total_pairs=16 > num_local_experts=8
],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Test case (4, 8, 16) may not reliably exercise the micro-kernel path.

routed_rows = num_tokens * top_k = 4 * 8 = 32. The micro-kernel is selected only when routed_rows <= micro_cutover, which the PR describes as "typically 20–40". If micro_cutover is 20 for the target hardware/configuration, this case silently falls through to the standard path and does not exercise the regression being fixed.

Consider replacing (4, 8, 16) with a case whose routed_rows is safely within the guaranteed micro cutover range — e.g. (2, 8, 16) gives routed_rows=16 < 20.

🔧 Suggested replacement
-            (4, 8, 16),  # total_pairs=32 > num_local_experts=16
+            (2, 8, 16),  # total_pairs=16 > num_local_experts=16, routed_rows safely within micro cutover
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
(2, 8, 8), # total_pairs=16 > num_local_experts=8
(4, 8, 16), # total_pairs=32 > num_local_experts=16
(4, 4, 8), # total_pairs=16 > num_local_experts=8
],
(2, 8, 8), # total_pairs=16 > num_local_experts=8
(2, 8, 16), # total_pairs=16 > num_local_experts=16, routed_rows safely within micro cutover
(4, 4, 8), # total_pairs=16 > num_local_experts=8
],
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/moe/test_b12x_fused_moe.py` around lines 1073 - 1076, The test case
tuple (4, 8, 16) in the parameter list can produce routed_rows=32 which may not
hit the micro-kernel path; replace that tuple with (2, 8, 16) so routed_rows
becomes 16 (< typical micro_cutover) to reliably exercise the micro-kernel path
in the test (update the tuple in the list of test cases in
test_b12x_fused_moe.py).

@kahyunnam kahyunnam enabled auto-merge (squash) May 6, 2026 22:02
@kahyunnam kahyunnam added the v0.6.11 release blocker label for 0.6.11 label May 6, 2026
@aleozlx aleozlx added the run-ci label May 6, 2026
@kahyunnam kahyunnam merged commit 14f2bee into flashinfer-ai:main May 7, 2026
83 of 102 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

op: moe run-ci v0.6.11 release blocker label for 0.6.11

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants