consolidate behavioud of routing in scattermoe kernels by winglian · Pull Request #3475 · axolotl-ai-cloud/axolotl

winglian · 2026-03-07T05:29:14Z

Description

match behaviour of sonicmoe of softmax and sigmoid routing in scattermoe
capture the scattermoe autotuning best kernel metrics so we can prune them better later

Motivation and Context

How has this been tested?

AI Usage Disclaimer

Claude

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

Summary by CodeRabbit

Refactor
- Internal kernel code restructured with centralized helper functions for routing logic and expert computations, reducing code complexity while supporting multiple routing schemes and LoRA configurations.
Tests
- Comprehensive test coverage added for mixture-of-experts kernels, including multiple routing strategies (softmax and sigmoid variants), shared expert scenarios, and reference implementations for validation against actual performance.

coderabbitai · 2026-03-07T05:29:31Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 795da081-357b-4083-859d-a17955f642b1

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR refactors the scattermoe-lora kernel integration by introducing centralized routing helper functions (_softmax_topk_route, _sigmoid_topk_route, _route dispatcher) and a shared-expert computation helper (_compute_shared_expert). The HFScatterMoEGatedMLP forward path is updated to use these helpers, reducing in-method complexity. Comprehensive end-to-end and unit tests are added to validate sigmoid and softmax routing variants.

Changes

Cohort / File(s)	Summary
Kernel Implementation `src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py`	Introduced routing helper functions (_softmax_topk_route, _sigmoid_topk_route) and a dispatcher (_route) to centralize routing logic. Implemented _compute_shared_expert helper for shared expert handling. Refactored HFScatterMoEGatedMLP forward path to delegate routing and shared-expert computations to these helpers, reducing complexity.
End-to-End Tests `tests/e2e/integrations/test_scattermoe_lora_kernels.py`	Added comprehensive end-to-end tests including _reference_moe_forward reference implementation and _make_mock_sigmoid_moe_block factory. Introduced TestHFScatterMoESigmoidRouting and TestHFScatterMoESigmoidWithSharedExperts test classes validating sigmoid/softmax routing variants, grouped/ungrouped routing, and shared-expert scenarios.
Unit Test Expansion `tests/integrations/test_scattermoe_lora.py`	Expanded test coverage with routing strategy detection, sigmoid routing tests for both softmax (Qwen/OLMoE) and sigmoid (GLM/DeepSeek) paths. Added tests for sigmoid top-k routing properties, _route dispatcher logic, and generic shared expert handling with various attribute names and gating behaviors.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

#3474: Directly targets the same scattermoe-lora kernel and test files with comprehensive GPU correctness validation for sigmoid/softmax routing and LoRA behaviors.
#3411: Introduces similar softmax-topk and sigmoid-topk routing helpers and centralized MoE kernel integration patterns.

Suggested labels

ready to merge, scheduled_release

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'consolidate behavioud of routing in scattermoe kernels' contains a typo ('behavioud' should be 'behavior') and refers to consolidating routing, which matches the PR's refactoring of routing logic into helper functions. However, the typo reduces clarity.
Docstring Coverage	✅ Passed	Docstring coverage is 81.08% which is sufficient. The required threshold is 80.00%.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch scattermoe-route-opts

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (4)

tests/integrations/test_scattermoe_lora.py (1)
369-380: Consider adding missing attributes to the bias_on_gate=False mock.

When bias_on_gate=False, the mock moe_block is missing n_routed_experts, n_group, norm_topk_prob, and routed_scaling_factor. While _sigmoid_topk_route handles these with getattr defaults, explicitly setting them would make the test more representative of real model structures and exercise more code paths consistently.
🧪 Suggested enhancement
     else:
         # minimax_m2 style: bias on block, not gate
         gate = SimpleNamespace(
             weight=torch.randn(E, H),
             top_k=K,
         )
         moe_block = SimpleNamespace(
             gate=gate,
             top_k=K,
             e_score_correction_bias=torch.zeros(E),
+            n_routed_experts=E,
+            n_group=1,
+            norm_topk_prob=True,
+            routed_scaling_factor=1.0,
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integrations/test_scattermoe_lora.py` around lines 369 - 380, The mock
created for the bias_on_gate=False branch lacks several attributes present on
real blocks; update the SimpleNamespace `moe_block` (and/or `gate`) in that
branch to include `n_routed_experts`, `n_group`, `norm_topk_prob`, and
`routed_scaling_factor` with sensible default tensors/scalars (e.g., ints or
torch tensors matching expected shapes) so tests exercise the same code paths as
real models and avoid relying on getattr defaults in `_sigmoid_topk_route`.
tests/e2e/integrations/test_scattermoe_lora_kernels.py (2)
1489-1537: Unused parameters in reference implementation.

The gate_weight and num_experts parameters (flagged by static analysis) are unused. The routing decision is already encoded in routing_weights/selected_experts, and num_experts can be inferred from gate_up_proj.shape[0]. Consider removing them or prefixing with underscore if kept for API consistency.
🧹 Suggested cleanup
 def _reference_moe_forward(
     hidden_states,
-    gate_weight,
     gate_up_proj,
     down_proj,
     act_fn,
     routing_weights,
     selected_experts,
-    num_experts,
 ):
Then update call sites accordingly.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/integrations/test_scattermoe_lora_kernels.py` around lines 1489 -
1537, The _reference_moe_forward function currently has unused parameters
gate_weight and num_experts; remove these parameters from the signature and all
call sites (or if you must keep them for API compatibility, rename them to
_gate_weight and _num_experts to mark as intentionally unused) and derive number
of experts from gate_up_proj.shape[0] where needed; update any calls to
_reference_moe_forward to match the new signature or continue passing the
arguments if renamed (keeping callers consistent with the change).
1570-1583: Mock for bias_on_gate=False case is incomplete but works for current tests.

The minimax_m2 style mock (lines 1570-1581) is missing several attributes (n_routed_experts, n_group, norm_topk_prob, routed_scaling_factor). This works because the test at line 1640 uses n_group=1 and _sigmoid_topk_route handles missing attributes with getattr defaults. Consider adding these for consistency with the bias_on_gate=True case.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/integrations/test_scattermoe_lora_kernels.py` around lines 1570 -
1583, The minimax_m2-style mock in the test creates gate and moe_block
SimpleNamespace objects but omits attributes expected in the bias_on_gate=True
case; add the missing attributes n_routed_experts, n_group, norm_topk_prob, and
routed_scaling_factor to the moe_block (and any needed defaults on gate) so the
mock mirrors the other branch; update the SimpleNamespace construction for
gate/moe_block used in test_scattermoe_lora_kernels.py (the minimax_m2 mock) to
include these attributes with sensible defaults (e.g., n_group=1 and zeros/ones
as appropriate) so code paths like _sigmoid_topk_route see consistent fields.
src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py (1)
228-248: Unused moe_block parameter is acceptable for API consistency.

The moe_block parameter is unused here (as flagged by static analysis) but provides API symmetry with _sigmoid_topk_route, allowing the _route dispatcher to call both functions with the same signature. Consider prefixing with underscore to silence the linter.
🔧 Suggested fix to silence linter
 def _softmax_topk_route(
-    moe_block, base_gate, hidden_states, gate_weight, gate_lora_delta
+    _moe_block, base_gate, hidden_states, gate_weight, gate_lora_delta
 ):
Note: You'd also need to update the call site in _route to pass the argument positionally or update the parameter name there.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py` around lines
228 - 248, The unused parameter moe_block in _softmax_topk_route should be
renamed with a leading underscore (e.g., _moe_block) to silence the linter while
keeping API symmetry with _sigmoid_topk_route; update the function signature in
_softmax_topk_route and ensure the dispatcher _route still passes the argument
correctly (either positionally or by matching the new name) so call sites remain
consistent.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py`:
- Around line 275-300: If e_score_correction_bias is missing, avoid adding None
to router_probs by defaulting e_score_correction_bias to a zero tensor with the
same shape/device/dtype as router_probs (use getattr on base_gate and moe_block
as already done, then if None create torch.zeros_like(router_probs)); also guard
access to moe_block.topk_group by using getattr(moe_block, "topk_group", 1) (and
optionally validate it's an int >0) before using it in the group selection logic
so _sigmoid_topk_route / scores_for_choice won't raise TypeError/AttributeError
when moe_block is incomplete.

---

Nitpick comments:
In `@src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py`:
- Around line 228-248: The unused parameter moe_block in _softmax_topk_route
should be renamed with a leading underscore (e.g., _moe_block) to silence the
linter while keeping API symmetry with _sigmoid_topk_route; update the function
signature in _softmax_topk_route and ensure the dispatcher _route still passes
the argument correctly (either positionally or by matching the new name) so call
sites remain consistent.

In `@tests/e2e/integrations/test_scattermoe_lora_kernels.py`:
- Around line 1489-1537: The _reference_moe_forward function currently has
unused parameters gate_weight and num_experts; remove these parameters from the
signature and all call sites (or if you must keep them for API compatibility,
rename them to _gate_weight and _num_experts to mark as intentionally unused)
and derive number of experts from gate_up_proj.shape[0] where needed; update any
calls to _reference_moe_forward to match the new signature or continue passing
the arguments if renamed (keeping callers consistent with the change).
- Around line 1570-1583: The minimax_m2-style mock in the test creates gate and
moe_block SimpleNamespace objects but omits attributes expected in the
bias_on_gate=True case; add the missing attributes n_routed_experts, n_group,
norm_topk_prob, and routed_scaling_factor to the moe_block (and any needed
defaults on gate) so the mock mirrors the other branch; update the
SimpleNamespace construction for gate/moe_block used in
test_scattermoe_lora_kernels.py (the minimax_m2 mock) to include these
attributes with sensible defaults (e.g., n_group=1 and zeros/ones as
appropriate) so code paths like _sigmoid_topk_route see consistent fields.

In `@tests/integrations/test_scattermoe_lora.py`:
- Around line 369-380: The mock created for the bias_on_gate=False branch lacks
several attributes present on real blocks; update the SimpleNamespace
`moe_block` (and/or `gate`) in that branch to include `n_routed_experts`,
`n_group`, `norm_topk_prob`, and `routed_scaling_factor` with sensible default
tensors/scalars (e.g., ints or torch tensors matching expected shapes) so tests
exercise the same code paths as real models and avoid relying on getattr
defaults in `_sigmoid_topk_route`.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4437119c-ec65-4b85-990f-4375371326b5

📥 Commits

Reviewing files that changed from the base of the PR and between a36aaa7 and 0b1e02a.

📒 Files selected for processing (3)

src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py
tests/e2e/integrations/test_scattermoe_lora_kernels.py
tests/integrations/test_scattermoe_lora.py

coderabbitai · 2026-03-07T05:35:36Z

+    # Bias-corrected scores for expert selection (not used for final weights).
+    # glm_moe_dsa/deepseek_v3 store the bias on gate; minimax_m2 on the block.
+    e_score_correction_bias = getattr(base_gate, "e_score_correction_bias", None)
+    if e_score_correction_bias is None:
+        e_score_correction_bias = getattr(moe_block, "e_score_correction_bias", None)
+    scores_for_choice = router_probs + e_score_correction_bias
+
+    # Group-based selection: pick top groups, mask the rest
+    n_group = getattr(moe_block, "n_group", 1)
+    if n_group > 1:
+        group_scores = (
+            scores_for_choice.view(-1, n_group, num_experts // n_group)
+            .topk(2, dim=-1)[0]
+            .sum(dim=-1)
+        )  # [T, n_group]
+        group_idx = torch.topk(
+            group_scores, k=moe_block.topk_group, dim=-1, sorted=False
+        )[1]
+        group_mask = torch.zeros_like(group_scores)
+        group_mask.scatter_(1, group_idx, 1)
+        score_mask = (
+            group_mask.unsqueeze(-1)
+            .expand(-1, n_group, num_experts // n_group)
+            .reshape(-1, num_experts)
+        )
+        scores_for_choice = scores_for_choice.masked_fill(~score_mask.bool(), 0.0)


⚠️ Potential issue | 🟡 Minor

Potential runtime errors if called with incomplete moe_block.

Two concerns:

Line 280: If e_score_correction_bias is None (both getattr calls fail), this line will raise TypeError: unsupported operand type(s) for +: 'Tensor' and 'NoneType'. While _route guards against this, direct calls to _sigmoid_topk_route could crash.

Line 291: moe_block.topk_group is accessed directly without getattr. If n_group > 1 but topk_group attribute is missing, this raises AttributeError.

🛡️ Proposed defensive fix

e_score_correction_bias = getattr(base_gate, "e_score_correction_bias", None) if e_score_correction_bias is None: e_score_correction_bias = getattr(moe_block, "e_score_correction_bias", None) + if e_score_correction_bias is None: + raise ValueError( + "_sigmoid_topk_route requires e_score_correction_bias on gate or moe_block" + ) scores_for_choice = router_probs + e_score_correction_bias # Group-based selection: pick top groups, mask the rest n_group = getattr(moe_block, "n_group", 1) if n_group > 1: group_scores = ( scores_for_choice.view(-1, n_group, num_experts // n_group) .topk(2, dim=-1)[0] .sum(dim=-1) ) # [T, n_group] + topk_group = getattr(moe_block, "topk_group", 1) group_idx = torch.topk( - group_scores, k=moe_block.topk_group, dim=-1, sorted=False + group_scores, k=topk_group, dim=-1, sorted=False )[1]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py` around lines 275 - 300, If e_score_correction_bias is missing, avoid adding None to router_probs by defaulting e_score_correction_bias to a zero tensor with the same shape/device/dtype as router_probs (use getattr on base_gate and moe_block as already done, then if None create torch.zeros_like(router_probs)); also guard access to moe_block.topk_group by using getattr(moe_block, "topk_group", 1) (and optionally validate it's an int >0) before using it in the group selection logic so _sigmoid_topk_route / scores_for_choice won't raise TypeError/AttributeError when moe_block is incomplete.

codecov · 2026-03-07T05:45:31Z

Codecov Report

❌ Patch coverage is 88.81579% with 17 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
.../axolotl/integrations/kernels/autotune_callback.py	75.51%	12 Missing ⚠️
...ntegrations/kernels/libs/scattermoe_lora/layers.py	94.64%	3 Missing ⚠️
...axolotl/integrations/kernels/autotune_collector.py	95.12%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

consolidate behavioud of routing in scattermoe kernels

0b1e02a

winglian requested a review from NanoCode012 March 7, 2026 05:29

coderabbitai Bot reviewed Mar 7, 2026

View reviewed changes

winglian added 4 commits March 7, 2026 02:21

collect telemetry on best chosen autotuned kernel

cd57457

properly collect data

ce0df21

Fix property name and get smem too

58297bb

handle issues raised by coderabbit

e6cd3d2

NanoCode012 reviewed Mar 9, 2026

View reviewed changes

Comment thread src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py

NanoCode012 reviewed Mar 9, 2026

View reviewed changes

Comment thread tests/integrations/test_scattermoe_lora.py

add tests for parity before refactoring

982869b

winglian merged commit 8f3fb51 into main Mar 17, 2026
29 of 30 checks passed

winglian deleted the scattermoe-route-opts branch March 17, 2026 03:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

consolidate behavioud of routing in scattermoe kernels#3475

consolidate behavioud of routing in scattermoe kernels#3475
winglian merged 6 commits into
mainfrom
scattermoe-route-opts

winglian commented Mar 7, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Mar 7, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Mar 7, 2026

Uh oh!

codecov Bot commented Mar 7, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

winglian commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

How has this been tested?

AI Usage Disclaimer

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

winglian commented Mar 7, 2026 •

edited

Loading

coderabbitai Bot commented Mar 7, 2026 •

edited

Loading

codecov Bot commented Mar 7, 2026 •

edited

Loading