Skip to content

consolidate behavioud of routing in scattermoe kernels#3475

Merged
winglian merged 6 commits into
mainfrom
scattermoe-route-opts
Mar 17, 2026
Merged

consolidate behavioud of routing in scattermoe kernels#3475
winglian merged 6 commits into
mainfrom
scattermoe-route-opts

Conversation

@winglian

@winglian winglian commented Mar 7, 2026

Copy link
Copy Markdown
Collaborator

Description

match behaviour of sonicmoe of softmax and sigmoid routing in scattermoe
capture the scattermoe autotuning best kernel metrics so we can prune them better later

Motivation and Context

How has this been tested?

AI Usage Disclaimer

Claude

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

Summary by CodeRabbit

  • Refactor

    • Internal kernel code restructured with centralized helper functions for routing logic and expert computations, reducing code complexity while supporting multiple routing schemes and LoRA configurations.
  • Tests

    • Comprehensive test coverage added for mixture-of-experts kernels, including multiple routing strategies (softmax and sigmoid variants), shared expert scenarios, and reference implementations for validation against actual performance.

@winglian winglian requested a review from NanoCode012 March 7, 2026 05:29
@coderabbitai

coderabbitai Bot commented Mar 7, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 795da081-357b-4083-859d-a17955f642b1

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR refactors the scattermoe-lora kernel integration by introducing centralized routing helper functions (_softmax_topk_route, _sigmoid_topk_route, _route dispatcher) and a shared-expert computation helper (_compute_shared_expert). The HFScatterMoEGatedMLP forward path is updated to use these helpers, reducing in-method complexity. Comprehensive end-to-end and unit tests are added to validate sigmoid and softmax routing variants.

Changes

Cohort / File(s) Summary
Kernel Implementation
src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py
Introduced routing helper functions (_softmax_topk_route, _sigmoid_topk_route) and a dispatcher (_route) to centralize routing logic. Implemented _compute_shared_expert helper for shared expert handling. Refactored HFScatterMoEGatedMLP forward path to delegate routing and shared-expert computations to these helpers, reducing complexity.
End-to-End Tests
tests/e2e/integrations/test_scattermoe_lora_kernels.py
Added comprehensive end-to-end tests including _reference_moe_forward reference implementation and _make_mock_sigmoid_moe_block factory. Introduced TestHFScatterMoESigmoidRouting and TestHFScatterMoESigmoidWithSharedExperts test classes validating sigmoid/softmax routing variants, grouped/ungrouped routing, and shared-expert scenarios.
Unit Test Expansion
tests/integrations/test_scattermoe_lora.py
Expanded test coverage with routing strategy detection, sigmoid routing tests for both softmax (Qwen/OLMoE) and sigmoid (GLM/DeepSeek) paths. Added tests for sigmoid top-k routing properties, _route dispatcher logic, and generic shared expert handling with various attribute names and gating behaviors.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • #3474: Directly targets the same scattermoe-lora kernel and test files with comprehensive GPU correctness validation for sigmoid/softmax routing and LoRA behaviors.
  • #3411: Introduces similar softmax-topk and sigmoid-topk routing helpers and centralized MoE kernel integration patterns.

Suggested labels

ready to merge, scheduled_release

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'consolidate behavioud of routing in scattermoe kernels' contains a typo ('behavioud' should be 'behavior') and refers to consolidating routing, which matches the PR's refactoring of routing logic into helper functions. However, the typo reduces clarity.
Docstring Coverage ✅ Passed Docstring coverage is 81.08% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch scattermoe-route-opts
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
tests/integrations/test_scattermoe_lora.py (1)

369-380: Consider adding missing attributes to the bias_on_gate=False mock.

When bias_on_gate=False, the mock moe_block is missing n_routed_experts, n_group, norm_topk_prob, and routed_scaling_factor. While _sigmoid_topk_route handles these with getattr defaults, explicitly setting them would make the test more representative of real model structures and exercise more code paths consistently.

🧪 Suggested enhancement
     else:
         # minimax_m2 style: bias on block, not gate
         gate = SimpleNamespace(
             weight=torch.randn(E, H),
             top_k=K,
         )
         moe_block = SimpleNamespace(
             gate=gate,
             top_k=K,
             e_score_correction_bias=torch.zeros(E),
+            n_routed_experts=E,
+            n_group=1,
+            norm_topk_prob=True,
+            routed_scaling_factor=1.0,
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integrations/test_scattermoe_lora.py` around lines 369 - 380, The mock
created for the bias_on_gate=False branch lacks several attributes present on
real blocks; update the SimpleNamespace `moe_block` (and/or `gate`) in that
branch to include `n_routed_experts`, `n_group`, `norm_topk_prob`, and
`routed_scaling_factor` with sensible default tensors/scalars (e.g., ints or
torch tensors matching expected shapes) so tests exercise the same code paths as
real models and avoid relying on getattr defaults in `_sigmoid_topk_route`.
tests/e2e/integrations/test_scattermoe_lora_kernels.py (2)

1489-1537: Unused parameters in reference implementation.

The gate_weight and num_experts parameters (flagged by static analysis) are unused. The routing decision is already encoded in routing_weights/selected_experts, and num_experts can be inferred from gate_up_proj.shape[0]. Consider removing them or prefixing with underscore if kept for API consistency.

🧹 Suggested cleanup
 def _reference_moe_forward(
     hidden_states,
-    gate_weight,
     gate_up_proj,
     down_proj,
     act_fn,
     routing_weights,
     selected_experts,
-    num_experts,
 ):

Then update call sites accordingly.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/integrations/test_scattermoe_lora_kernels.py` around lines 1489 -
1537, The _reference_moe_forward function currently has unused parameters
gate_weight and num_experts; remove these parameters from the signature and all
call sites (or if you must keep them for API compatibility, rename them to
_gate_weight and _num_experts to mark as intentionally unused) and derive number
of experts from gate_up_proj.shape[0] where needed; update any calls to
_reference_moe_forward to match the new signature or continue passing the
arguments if renamed (keeping callers consistent with the change).

1570-1583: Mock for bias_on_gate=False case is incomplete but works for current tests.

The minimax_m2 style mock (lines 1570-1581) is missing several attributes (n_routed_experts, n_group, norm_topk_prob, routed_scaling_factor). This works because the test at line 1640 uses n_group=1 and _sigmoid_topk_route handles missing attributes with getattr defaults. Consider adding these for consistency with the bias_on_gate=True case.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/integrations/test_scattermoe_lora_kernels.py` around lines 1570 -
1583, The minimax_m2-style mock in the test creates gate and moe_block
SimpleNamespace objects but omits attributes expected in the bias_on_gate=True
case; add the missing attributes n_routed_experts, n_group, norm_topk_prob, and
routed_scaling_factor to the moe_block (and any needed defaults on gate) so the
mock mirrors the other branch; update the SimpleNamespace construction for
gate/moe_block used in test_scattermoe_lora_kernels.py (the minimax_m2 mock) to
include these attributes with sensible defaults (e.g., n_group=1 and zeros/ones
as appropriate) so code paths like _sigmoid_topk_route see consistent fields.
src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py (1)

228-248: Unused moe_block parameter is acceptable for API consistency.

The moe_block parameter is unused here (as flagged by static analysis) but provides API symmetry with _sigmoid_topk_route, allowing the _route dispatcher to call both functions with the same signature. Consider prefixing with underscore to silence the linter.

🔧 Suggested fix to silence linter
 def _softmax_topk_route(
-    moe_block, base_gate, hidden_states, gate_weight, gate_lora_delta
+    _moe_block, base_gate, hidden_states, gate_weight, gate_lora_delta
 ):

Note: You'd also need to update the call site in _route to pass the argument positionally or update the parameter name there.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py` around lines
228 - 248, The unused parameter moe_block in _softmax_topk_route should be
renamed with a leading underscore (e.g., _moe_block) to silence the linter while
keeping API symmetry with _sigmoid_topk_route; update the function signature in
_softmax_topk_route and ensure the dispatcher _route still passes the argument
correctly (either positionally or by matching the new name) so call sites remain
consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py`:
- Around line 275-300: If e_score_correction_bias is missing, avoid adding None
to router_probs by defaulting e_score_correction_bias to a zero tensor with the
same shape/device/dtype as router_probs (use getattr on base_gate and moe_block
as already done, then if None create torch.zeros_like(router_probs)); also guard
access to moe_block.topk_group by using getattr(moe_block, "topk_group", 1) (and
optionally validate it's an int >0) before using it in the group selection logic
so _sigmoid_topk_route / scores_for_choice won't raise TypeError/AttributeError
when moe_block is incomplete.

---

Nitpick comments:
In `@src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py`:
- Around line 228-248: The unused parameter moe_block in _softmax_topk_route
should be renamed with a leading underscore (e.g., _moe_block) to silence the
linter while keeping API symmetry with _sigmoid_topk_route; update the function
signature in _softmax_topk_route and ensure the dispatcher _route still passes
the argument correctly (either positionally or by matching the new name) so call
sites remain consistent.

In `@tests/e2e/integrations/test_scattermoe_lora_kernels.py`:
- Around line 1489-1537: The _reference_moe_forward function currently has
unused parameters gate_weight and num_experts; remove these parameters from the
signature and all call sites (or if you must keep them for API compatibility,
rename them to _gate_weight and _num_experts to mark as intentionally unused)
and derive number of experts from gate_up_proj.shape[0] where needed; update any
calls to _reference_moe_forward to match the new signature or continue passing
the arguments if renamed (keeping callers consistent with the change).
- Around line 1570-1583: The minimax_m2-style mock in the test creates gate and
moe_block SimpleNamespace objects but omits attributes expected in the
bias_on_gate=True case; add the missing attributes n_routed_experts, n_group,
norm_topk_prob, and routed_scaling_factor to the moe_block (and any needed
defaults on gate) so the mock mirrors the other branch; update the
SimpleNamespace construction for gate/moe_block used in
test_scattermoe_lora_kernels.py (the minimax_m2 mock) to include these
attributes with sensible defaults (e.g., n_group=1 and zeros/ones as
appropriate) so code paths like _sigmoid_topk_route see consistent fields.

In `@tests/integrations/test_scattermoe_lora.py`:
- Around line 369-380: The mock created for the bias_on_gate=False branch lacks
several attributes present on real blocks; update the SimpleNamespace
`moe_block` (and/or `gate`) in that branch to include `n_routed_experts`,
`n_group`, `norm_topk_prob`, and `routed_scaling_factor` with sensible default
tensors/scalars (e.g., ints or torch tensors matching expected shapes) so tests
exercise the same code paths as real models and avoid relying on getattr
defaults in `_sigmoid_topk_route`.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4437119c-ec65-4b85-990f-4375371326b5

📥 Commits

Reviewing files that changed from the base of the PR and between a36aaa7 and 0b1e02a.

📒 Files selected for processing (3)
  • src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py
  • tests/e2e/integrations/test_scattermoe_lora_kernels.py
  • tests/integrations/test_scattermoe_lora.py

Comment on lines +275 to +300
# Bias-corrected scores for expert selection (not used for final weights).
# glm_moe_dsa/deepseek_v3 store the bias on gate; minimax_m2 on the block.
e_score_correction_bias = getattr(base_gate, "e_score_correction_bias", None)
if e_score_correction_bias is None:
e_score_correction_bias = getattr(moe_block, "e_score_correction_bias", None)
scores_for_choice = router_probs + e_score_correction_bias

# Group-based selection: pick top groups, mask the rest
n_group = getattr(moe_block, "n_group", 1)
if n_group > 1:
group_scores = (
scores_for_choice.view(-1, n_group, num_experts // n_group)
.topk(2, dim=-1)[0]
.sum(dim=-1)
) # [T, n_group]
group_idx = torch.topk(
group_scores, k=moe_block.topk_group, dim=-1, sorted=False
)[1]
group_mask = torch.zeros_like(group_scores)
group_mask.scatter_(1, group_idx, 1)
score_mask = (
group_mask.unsqueeze(-1)
.expand(-1, n_group, num_experts // n_group)
.reshape(-1, num_experts)
)
scores_for_choice = scores_for_choice.masked_fill(~score_mask.bool(), 0.0)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential runtime errors if called with incomplete moe_block.

Two concerns:

  1. Line 280: If e_score_correction_bias is None (both getattr calls fail), this line will raise TypeError: unsupported operand type(s) for +: 'Tensor' and 'NoneType'. While _route guards against this, direct calls to _sigmoid_topk_route could crash.

  2. Line 291: moe_block.topk_group is accessed directly without getattr. If n_group > 1 but topk_group attribute is missing, this raises AttributeError.

🛡️ Proposed defensive fix
     e_score_correction_bias = getattr(base_gate, "e_score_correction_bias", None)
     if e_score_correction_bias is None:
         e_score_correction_bias = getattr(moe_block, "e_score_correction_bias", None)
+    if e_score_correction_bias is None:
+        raise ValueError(
+            "_sigmoid_topk_route requires e_score_correction_bias on gate or moe_block"
+        )
     scores_for_choice = router_probs + e_score_correction_bias
 
     # Group-based selection: pick top groups, mask the rest
     n_group = getattr(moe_block, "n_group", 1)
     if n_group > 1:
         group_scores = (
             scores_for_choice.view(-1, n_group, num_experts // n_group)
             .topk(2, dim=-1)[0]
             .sum(dim=-1)
         )  # [T, n_group]
+        topk_group = getattr(moe_block, "topk_group", 1)
         group_idx = torch.topk(
-            group_scores, k=moe_block.topk_group, dim=-1, sorted=False
+            group_scores, k=topk_group, dim=-1, sorted=False
         )[1]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py` around lines
275 - 300, If e_score_correction_bias is missing, avoid adding None to
router_probs by defaulting e_score_correction_bias to a zero tensor with the
same shape/device/dtype as router_probs (use getattr on base_gate and moe_block
as already done, then if None create torch.zeros_like(router_probs)); also guard
access to moe_block.topk_group by using getattr(moe_block, "topk_group", 1) (and
optionally validate it's an int >0) before using it in the group selection logic
so _sigmoid_topk_route / scores_for_choice won't raise TypeError/AttributeError
when moe_block is incomplete.

@codecov

codecov Bot commented Mar 7, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 88.81579% with 17 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
.../axolotl/integrations/kernels/autotune_callback.py 75.51% 12 Missing ⚠️
...ntegrations/kernels/libs/scattermoe_lora/layers.py 94.64% 3 Missing ⚠️
...axolotl/integrations/kernels/autotune_collector.py 95.12% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

Comment thread src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py
Comment thread tests/integrations/test_scattermoe_lora.py
@winglian winglian merged commit 8f3fb51 into main Mar 17, 2026
29 of 30 checks passed
@winglian winglian deleted the scattermoe-route-opts branch March 17, 2026 03:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants