feat: add moe kernel support for non-glu#3558
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughAdded support for the "nemotron_h" MoE model type with GLU/non-GLU expert architecture flexibility in ScatterMoE and SonicMoE kernel integrations, including conditional weight converter registration and latent projection layer support. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py`:
- Around line 494-500: _sigmoid_topk_route currently only falls back to
base_gate for top_k, causing Nemotron-H routers to miss other router-side
fields; update _sigmoid_topk_route to also fallback to base_gate for the router
attributes n_group, topk_group, norm_topk_prob, and routed_scaling_factor when
the detected expert is non-GLU (match the behavior of NemotronHTopkRouter), so
grouped selection and scaling use the same defaults as the base_gate; check for
these attributes on the router object and assign from base_gate when missing,
keeping existing logic for top_k fallback intact.
In `@src/axolotl/integrations/kernels/sonicmoe/routing.py`:
- Around line 60-62: sigmoid_topk_routing currently accesses moe_block.top_k and
moe_block.topk_group directly which can be missing; change those accesses to use
getattr(moe_block, "top_k", getattr(gate, "top_k", <default>)) and
getattr(moe_block, "topk_group", getattr(gate, "topk_group", <default>)) so they
follow the existing fallback pattern used for e_score_correction_bias and other
optional params; update the references in sigmoid_topk_routing to use these
getattr calls (and pick the same sensible defaults used elsewhere in the
function) so missing attributes on moe_block fall back to gate before
defaulting.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: f0b62098-6468-40b3-a997-5a3912992ad9
📒 Files selected for processing (4)
src/axolotl/integrations/kernels/constants.pysrc/axolotl/integrations/kernels/libs/scattermoe_lora/layers.pysrc/axolotl/integrations/kernels/sonicmoe/patch.pysrc/axolotl/integrations/kernels/sonicmoe/routing.py
| # ==================================================================== | ||
| # Detect GLU vs non-GLU expert architecture | ||
| # ==================================================================== | ||
| # GLU models (Qwen, Mixtral, etc.): gate_up_proj [E, 2*I, H] | ||
| # Non-GLU models (Nemotron-H, etc.): up_proj [E, I, H] | ||
| has_glu = hasattr(experts, "gate_up_proj") | ||
| up_proj_name = "gate_up_proj" if has_glu else "up_proj" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -name "*.py" -path "*scattermoe*" -type fRepository: axolotl-ai-cloud/axolotl
Length of output: 1190
🏁 Script executed:
cat -n src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py | head -50Repository: axolotl-ai-cloud/axolotl
Length of output: 2481
🏁 Script executed:
wc -l src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.pyRepository: axolotl-ai-cloud/axolotl
Length of output: 135
🏁 Script executed:
sed -n '280,320p' src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.pyRepository: axolotl-ai-cloud/axolotl
Length of output: 1788
🏁 Script executed:
sed -n '494,510p' src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.pyRepository: axolotl-ai-cloud/axolotl
Length of output: 1041
🏁 Script executed:
grep -n "_sigmoid_topk_route" src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.pyRepository: axolotl-ai-cloud/axolotl
Length of output: 136
🏁 Script executed:
sed -n '253,330p' src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.pyRepository: axolotl-ai-cloud/axolotl
Length of output: 3436
🏁 Script executed:
sed -n '253,250p' src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.pyRepository: axolotl-ai-cloud/axolotl
Length of output: 92
🏁 Script executed:
sed -n '253,340p' src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.pyRepository: axolotl-ai-cloud/axolotl
Length of output: 3774
🏁 Script executed:
sed -n '195,215p' src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.pyRepository: axolotl-ai-cloud/axolotl
Length of output: 972
🏁 Script executed:
sed -n '330,340p' src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.pyRepository: axolotl-ai-cloud/axolotl
Length of output: 411
🏁 Script executed:
rg -A 10 "def _unwrap_experts_lora" src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.pyRepository: axolotl-ai-cloud/axolotl
Length of output: 530
🌐 Web query:
Nemotron-H NemotronHMoE NemotronHTopkRouter router fields n_group topk_group site:huggingface.co
💡 Result:
Nemotron-H (NemotronH) is a hybrid Mamba-Transformer architecture from NVIDIA that includes Mixture-of-Experts (MoE) layers. The MoE uses a top-k router called NemotronHTopkRouter. Key router fields from NemotronHConfig: - n_group (int, default 1): Number of groups for expert routing. Experts are partitioned into this many groups. - topk_group (int, default 1): Number of selected groups for each token. Ensures selected experts are only within topk_group groups, implementing grouped top-k selection to reduce communication or improve efficiency. - Related: num_experts_per_tok (int, default 2): Top-k value, number of experts routed per token. - n_routed_experts (int, default 8): Total number of routed experts. The NemotronHTopkRouter computes routing logits (linear projection + sigmoid), selects top-K experts using grouped strategy (first top topk_group groups, then top num_experts_per_tok experts within), gathers weights, optionally normalizes, and applies scaling. This supports load balancing and group-based selection for hardware efficiency in distributed setups.
Citations:
- 1: https://huggingface.co/docs/transformers/main/model_doc/nemotron_h
- 2: https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16/commit/40b9aefd60876e9c31e92feffc9455cbb0b06718
- 3: https://huggingface.co/docs/transformers/model_doc/nemotron_h
- 4: https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4/blob/main/modeling_nemotron_h.py
- 5: https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8/blame/a0ce4e9abcdbaeecf6481ac8f5fbc3ef7a2d8f98/modeling_nemotron_h.py
- 6: https://huggingface.co/sjug/Nemotron-3-Super-120B-A12B-MLX-4bit/resolve/main/configuration_nemotron_h.py?download=true
🏁 Script executed:
rg -A 5 -B 5 "n_group|topk_group|norm_topk_prob|routed_scaling_factor" src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py | head -80Repository: axolotl-ai-cloud/axolotl
Length of output: 2928
🏁 Script executed:
# Check if there are any tests or references to Nemotron-H in the codebase
rg -i "nemotron" src/Repository: axolotl-ai-cloud/axolotl
Length of output: 1235
Nemotron-H ScatterMoE routing still misses gate-side router fields.
These expert-side changes enable the non-GLU MLP path, but _sigmoid_topk_route() only falls back to base_gate for top_k. Upstream NemotronHTopkRouter also stores n_group, topk_group, norm_topk_prob, and routed_scaling_factor on the router, so Lines 288-316 will skip grouped selection and use default scaling for Nemotron-H unless those fields also fall back to base_gate.
Suggested fix
def _sigmoid_topk_route(
moe_block, base_gate, hidden_states, gate_weight, gate_lora_delta
):
@@
- num_experts = getattr(moe_block, "n_routed_experts", gate_weight.shape[0])
+ num_experts = getattr(
+ moe_block,
+ "n_routed_experts",
+ getattr(base_gate, "n_routed_experts", gate_weight.shape[0]),
+ )
@@
- n_group = getattr(moe_block, "n_group", 1)
+ n_group = getattr(moe_block, "n_group", getattr(base_gate, "n_group", 1))
if n_group > 1:
@@
- topk_group = getattr(moe_block, "topk_group", n_group)
+ topk_group = getattr(
+ moe_block, "topk_group", getattr(base_gate, "topk_group", n_group)
+ )
@@
- if getattr(moe_block, "norm_topk_prob", True):
+ if getattr(
+ moe_block, "norm_topk_prob", getattr(base_gate, "norm_topk_prob", True)
+ ):
topk_weights = topk_weights / (topk_weights.sum(dim=-1, keepdim=True) + 1e-20)
- routed_scaling_factor = getattr(moe_block, "routed_scaling_factor", 1.0)
+ routed_scaling_factor = getattr(
+ moe_block,
+ "routed_scaling_factor",
+ getattr(base_gate, "routed_scaling_factor", 1.0),
+ )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py` around lines
494 - 500, _sigmoid_topk_route currently only falls back to base_gate for top_k,
causing Nemotron-H routers to miss other router-side fields; update
_sigmoid_topk_route to also fallback to base_gate for the router attributes
n_group, topk_group, norm_topk_prob, and routed_scaling_factor when the detected
expert is non-GLU (match the behavior of NemotronHTopkRouter), so grouped
selection and scaling use the same defaults as the base_gate; check for these
attributes on the router object and assign from base_gate when missing, keeping
existing logic for top_k fallback intact.
| # Non-GLU MoE (no gate_proj, experts use up_proj + down_proj only) | ||
| elif model_type in ("nemotron_h",): | ||
| return sigmoid_topk_routing, ActivationType.RELU_SQ, "gate" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's find and examine the routing.py file
find . -name "routing.py" -type f | head -20Repository: axolotl-ai-cloud/axolotl
Length of output: 122
🏁 Script executed:
# Also check the sonicmoe directory structure
find . -path "*/sonicmoe*" -type f | head -30Repository: axolotl-ai-cloud/axolotl
Length of output: 295
🏁 Script executed:
# Let's look at the actual code at lines 60-62
head -70 src/axolotl/integrations/kernels/sonicmoe/routing.py | tail -20Repository: axolotl-ai-cloud/axolotl
Length of output: 955
🏁 Script executed:
# Let's look at the sigmoid_topk_routing function
cat -n src/axolotl/integrations/kernels/sonicmoe/routing.py | head -250Repository: axolotl-ai-cloud/axolotl
Length of output: 12192
🏁 Script executed:
# Continue reading the sigmoid_topk_routing function
cat -n src/axolotl/integrations/kernels/sonicmoe/routing.py | sed -n '250,300p'Repository: axolotl-ai-cloud/axolotl
Length of output: 1679
🏁 Script executed:
# Let's search for Nemotron-H model files in the codebase
find . -type f -name "*nemotron*" | head -20Repository: axolotl-ai-cloud/axolotl
Length of output: 115
🏁 Script executed:
# Search for any references to NemotronH in the codebase
rg -i "nemotron" --type py -A 5 -B 5 | head -100Repository: axolotl-ai-cloud/axolotl
Length of output: 8785
🏁 Script executed:
# Let's check what Nemotron-H MoE structure looks like in the example config
cat examples/nemotron/nemotron-mini-4b-qlora.yaml | head -50Repository: axolotl-ai-cloud/axolotl
Length of output: 877
🏁 Script executed:
# Now let's search for how sigmoid_topk_routing is currently being used or tested
rg "sigmoid_topk_routing" -A 10 -B 5 --type pyRepository: axolotl-ai-cloud/axolotl
Length of output: 29558
🏁 Script executed:
# Let's check the test helpers to understand how moe_block is structured
cat -n tests/integrations/test_sonicmoe.py | head -200Repository: axolotl-ai-cloud/axolotl
Length of output: 8887
🏁 Script executed:
# Search for _make_glm_moe_block and _make_minimax_m2_moe_block to understand the block structure
rg "_make_glm_moe_block|_make_minimax_m2_moe_block" -A 20 --type py tests/Repository: axolotl-ai-cloud/axolotl
Length of output: 9840
🌐 Web query:
Nemotron-H model structure top_k gate router attributes huggingface
💡 Result:
The Nemotron-H models are hybrid Mamba-Transformer architectures without MoE layers, using standard MLP (FFN) layers instead. They do not have a top_k gate router. Nemotron 3 Nano (30B-A3B), built on Nemotron-H, introduces sparse MoE layers replacing FFNs. It uses a learned MLP router (sigmoid gating) that activates the top 6 out of 128 routable experts per token, plus 2 shared experts, for ~3.6B active parameters per token from 31.6B total. Hugging Face NemotronHConfig supports MoE via layers_block_type including "moe", with num_experts_per_tok (default 2, top-k routing parameter), n_routed_experts (default 8), topk_group (default 1), and norm_topk_prob for normalizing top-k weights. Specific configs for Nemotron-H models lack MoE params as they use dense MLPs; MoE appears in Nemotron 3 models like Nano.
Citations:
- 1: https://research.nvidia.com/labs/adlr/nemotronh/
- 2: https://arxiv.org/html/2504.03624v1
- 3: https://huggingface.co/docs/transformers/model_doc/nemotron_h
- 4: https://huggingface.co/docs/transformers/main/model_doc/nemotron_h
- 5: https://arxiv.org/html/2512.20848v1
- 6: https://huggingface.co/sjug/Nemotron-3-Super-120B-A12B-MLX-4bit/resolve/main/configuration_nemotron_h.py?download=true
- 7: https://arxiv.org/pdf/2512.20848
- 8: https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4/resolve/refs%2Fpr%2F1/configuration_nemotron_h.py?download=true
- 9: https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4/blob/refs%2Fpr%2F1/configuration_nemotron_h.py
- 10: https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8/blame/refs%2Fpr%2F22/configuration_nemotron_h.py
- 11: https://huggingface.co/blog/nvidia/nemotron-3-nano-efficient-open-intelligent-models
🏁 Script executed:
# Also check if there are any local references or imports from the transformers library
rg "NemotronH|nemotron_h" -B 5 -A 10 --type py | grep -i "top_k\|n_group\|router\|gate"Repository: axolotl-ai-cloud/axolotl
Length of output: 1163
🏁 Script executed:
# Let's check if there are any actual Nemotron-H MoE model files or tests
find . -type f \( -name "*.py" -o -name "*.yaml" \) -exec grep -l "nemotron" {} \;Repository: axolotl-ai-cloud/axolotl
Length of output: 257
🏁 Script executed:
# Search for where top_k is defined or accessed in the routing code for other models
rg "top_k" -B 2 -A 2 src/axolotl/integrations/kernels/sonicmoe/routing.pyRepository: axolotl-ai-cloud/axolotl
Length of output: 1169
🏁 Script executed:
# Let's verify the actual parameter access patterns more carefully
# Check what parameters sigmoid_topk_routing expects and doesn't have fallbacks for
cat -n src/axolotl/integrations/kernels/sonicmoe/routing.py | sed -n '217,270p'Repository: axolotl-ai-cloud/axolotl
Length of output: 2926
🏁 Script executed:
# Check the other routing functions to see their patterns
cat -n src/axolotl/integrations/kernels/sonicmoe/routing.py | sed -n '84,145p'Repository: axolotl-ai-cloud/axolotl
Length of output: 3079
🏁 Script executed:
# Let's check the docstring for sigmoid_topk_routing to see what it documents as expected parameters
cat -n src/axolotl/integrations/kernels/sonicmoe/routing.py | sed -n '191,216p'Repository: axolotl-ai-cloud/axolotl
Length of output: 1425
🏁 Script executed:
# Let's look at the patch.py to see how the routing function is called with the actual moe_block
cat -n src/axolotl/integrations/kernels/sonicmoe/patch.py | head -150Repository: axolotl-ai-cloud/axolotl
Length of output: 6967
🏁 Script executed:
# Check if there's any model-specific handling or initialization for Nemotron-H
rg "nemotron" -B 10 -A 10 --type py src/Repository: axolotl-ai-cloud/axolotl
Length of output: 5529
Add fallback logic for top_k and topk_group in sigmoid_topk_routing.
Lines 219 and 247 access moe_block.top_k and moe_block.topk_group directly without fallback, while other optional parameters in the same function (lines 263, 266) use getattr() with defaults. The function docstring marks these as optional, yet the code enforces them on moe_block only. Use getattr() with gate as secondary source to match the pattern established for e_score_correction_bias (lines 229-231):
Suggested fix
def sigmoid_topk_routing(
hidden_states: torch.Tensor, moe_block
) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
gate = moe_block.gate
T, H = hidden_states.shape
- K = moe_block.top_k
+ K = getattr(moe_block, "top_k", getattr(gate, "top_k", None))
+ if K is None:
+ raise AttributeError(
+ f"sigmoid_topk_routing requires top_k on moe_block or gate, "
+ f"but neither has it"
+ )
E = getattr(
moe_block,
"n_routed_experts",
getattr(gate, "n_routed_experts", gate.weight.shape[0]),
)
n_group = getattr(moe_block, "n_group", getattr(gate, "n_group", 1))
# ... (lines 223-246 unchanged) ...
if n_group > 1:
# ... (lines 241-245 unchanged) ...
- group_idx = torch.topk(
- group_scores, k=moe_block.topk_group, dim=-1, sorted=False
- )[1]
+ topk_group = getattr(
+ moe_block, "topk_group", getattr(gate, "topk_group", n_group)
+ )
+ group_idx = torch.topk(group_scores, k=topk_group, dim=-1, sorted=False)[1]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Non-GLU MoE (no gate_proj, experts use up_proj + down_proj only) | |
| elif model_type in ("nemotron_h",): | |
| return sigmoid_topk_routing, ActivationType.RELU_SQ, "gate" | |
| def sigmoid_topk_routing( | |
| hidden_states: torch.Tensor, moe_block | |
| ) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]: | |
| gate = moe_block.gate | |
| T, H = hidden_states.shape | |
| K = getattr(moe_block, "top_k", getattr(gate, "top_k", None)) | |
| if K is None: | |
| raise AttributeError( | |
| f"sigmoid_topk_routing requires top_k on moe_block or gate, " | |
| f"but neither has it" | |
| ) | |
| E = getattr( | |
| moe_block, | |
| "n_routed_experts", | |
| getattr(gate, "n_routed_experts", gate.weight.shape[0]), | |
| ) | |
| n_group = getattr(moe_block, "n_group", getattr(gate, "n_group", 1)) | |
| # ... rest of function with change at topk_group access ... | |
| if n_group > 1: | |
| # ... | |
| topk_group = getattr( | |
| moe_block, "topk_group", getattr(gate, "topk_group", n_group) | |
| ) | |
| group_idx = torch.topk(group_scores, k=topk_group, dim=-1, sorted=False)[1] |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/axolotl/integrations/kernels/sonicmoe/routing.py` around lines 60 - 62,
sigmoid_topk_routing currently accesses moe_block.top_k and moe_block.topk_group
directly which can be missing; change those accesses to use getattr(moe_block,
"top_k", getattr(gate, "top_k", <default>)) and getattr(moe_block, "topk_group",
getattr(gate, "topk_group", <default>)) so they follow the existing fallback
pattern used for e_score_correction_bias and other optional params; update the
references in sigmoid_topk_routing to use these getattr calls (and pick the same
sensible defaults used elsewhere in the function) so missing attributes on
moe_block fall back to gate before defaulting.
|
tried running on #3508 with scatter working fine 🫡 |
|
Did we run any correctness tests/checks for the scattermoe changes? |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
|
we have some tests in |

Description
This is required for NemotronH and can be useful for any future models.
Sonicmoe has internal handling for non-glu, we just needed to patch our end to pass it properly.
Scattermoe required patching to handle this.
Motivation and Context
How has this been tested?
AI Usage Disclaimer
Screenshots (if appropriate)
Types of changes
Social Handles (Optional)
Summary by CodeRabbit