Skip to content

feat: add moe kernel support for non-glu#3558

Open
NanoCode012 wants to merge 5 commits into
axolotl-ai-cloud:mainfrom
NanoCode012:feat/non-glu-kernel
Open

feat: add moe kernel support for non-glu#3558
NanoCode012 wants to merge 5 commits into
axolotl-ai-cloud:mainfrom
NanoCode012:feat/non-glu-kernel

Conversation

@NanoCode012

@NanoCode012 NanoCode012 commented Mar 30, 2026

Copy link
Copy Markdown
Collaborator

Description

This is required for NemotronH and can be useful for any future models.

Sonicmoe has internal handling for non-glu, we just needed to patch our end to pass it properly.
Scattermoe required patching to handle this.

Motivation and Context

How has this been tested?

AI Usage Disclaimer

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

Summary by CodeRabbit

  • New Features
    • Added support for the nemotron_h model type with optimized Mixture of Experts routing
    • Extended expert layer handling to support both GLU and non-GLU expert architectures
    • Enhanced MoE integration with optional latent projection layers for improved computation paths

@coderabbitai

coderabbitai Bot commented Mar 30, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4a104c73-f480-49a5-a4e1-fcce0a1bff61

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Added support for the "nemotron_h" MoE model type with GLU/non-GLU expert architecture flexibility in ScatterMoE and SonicMoE kernel integrations, including conditional weight converter registration and latent projection layer support.

Changes

Cohort / File(s) Summary
Model Registration
src/axolotl/integrations/kernels/constants.py, src/axolotl/integrations/kernels/sonicmoe/routing.py
Added "nemotron_h" model type mapping to SPARSE_MOE_BLOCK and added corresponding routing configuration with sigmoid_topk_routing and RELU_SQ activation in get_model_moe_config.
ScatterMoE LoRA Expert Handling
src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py
Refactored to support both GLU and non-GLU expert architectures. Added GLU detection via hasattr(experts, "gate_up_proj"), conditional LoRA weight extraction from either gate_up_proj or up_proj, and updated activation logic to handle multiplicative gating for GLU vs direct activation for non-GLU. Added optional latent projection layers (fc1_latent_proj, fc2_latent_proj).
SonicMoE Kernel Patch
src/axolotl/integrations/kernels/sonicmoe/patch.py
Implemented GLU detection and conditional weight converter registration. Added latent projection layers before/after expert routing, and conditional expert weight selection between gate_up_proj (GLU) and up_proj (non-GLU) for SonicMoE kernel inputs.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • winglian
  • SalmanMohammadi
🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 77.78% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: add moe kernel support for non-glu' accurately describes the main change: adding support for non-GLU MoE kernel configurations across multiple files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py`:
- Around line 494-500: _sigmoid_topk_route currently only falls back to
base_gate for top_k, causing Nemotron-H routers to miss other router-side
fields; update _sigmoid_topk_route to also fallback to base_gate for the router
attributes n_group, topk_group, norm_topk_prob, and routed_scaling_factor when
the detected expert is non-GLU (match the behavior of NemotronHTopkRouter), so
grouped selection and scaling use the same defaults as the base_gate; check for
these attributes on the router object and assign from base_gate when missing,
keeping existing logic for top_k fallback intact.

In `@src/axolotl/integrations/kernels/sonicmoe/routing.py`:
- Around line 60-62: sigmoid_topk_routing currently accesses moe_block.top_k and
moe_block.topk_group directly which can be missing; change those accesses to use
getattr(moe_block, "top_k", getattr(gate, "top_k", <default>)) and
getattr(moe_block, "topk_group", getattr(gate, "topk_group", <default>)) so they
follow the existing fallback pattern used for e_score_correction_bias and other
optional params; update the references in sigmoid_topk_routing to use these
getattr calls (and pick the same sensible defaults used elsewhere in the
function) so missing attributes on moe_block fall back to gate before
defaulting.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f0b62098-6468-40b3-a997-5a3912992ad9

📥 Commits

Reviewing files that changed from the base of the PR and between 00dee05 and c8d05de.

📒 Files selected for processing (4)
  • src/axolotl/integrations/kernels/constants.py
  • src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py
  • src/axolotl/integrations/kernels/sonicmoe/patch.py
  • src/axolotl/integrations/kernels/sonicmoe/routing.py

Comment on lines +494 to +500
# ====================================================================
# Detect GLU vs non-GLU expert architecture
# ====================================================================
# GLU models (Qwen, Mixtral, etc.): gate_up_proj [E, 2*I, H]
# Non-GLU models (Nemotron-H, etc.): up_proj [E, I, H]
has_glu = hasattr(experts, "gate_up_proj")
up_proj_name = "gate_up_proj" if has_glu else "up_proj"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

find . -name "*.py" -path "*scattermoe*" -type f

Repository: axolotl-ai-cloud/axolotl

Length of output: 1190


🏁 Script executed:

cat -n src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py | head -50

Repository: axolotl-ai-cloud/axolotl

Length of output: 2481


🏁 Script executed:

wc -l src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py

Repository: axolotl-ai-cloud/axolotl

Length of output: 135


🏁 Script executed:

sed -n '280,320p' src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py

Repository: axolotl-ai-cloud/axolotl

Length of output: 1788


🏁 Script executed:

sed -n '494,510p' src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py

Repository: axolotl-ai-cloud/axolotl

Length of output: 1041


🏁 Script executed:

grep -n "_sigmoid_topk_route" src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py

Repository: axolotl-ai-cloud/axolotl

Length of output: 136


🏁 Script executed:

sed -n '253,330p' src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py

Repository: axolotl-ai-cloud/axolotl

Length of output: 3436


🏁 Script executed:

sed -n '253,250p' src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py

Repository: axolotl-ai-cloud/axolotl

Length of output: 92


🏁 Script executed:

sed -n '253,340p' src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py

Repository: axolotl-ai-cloud/axolotl

Length of output: 3774


🏁 Script executed:

sed -n '195,215p' src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py

Repository: axolotl-ai-cloud/axolotl

Length of output: 972


🏁 Script executed:

sed -n '330,340p' src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py

Repository: axolotl-ai-cloud/axolotl

Length of output: 411


🏁 Script executed:

rg -A 10 "def _unwrap_experts_lora" src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py

Repository: axolotl-ai-cloud/axolotl

Length of output: 530


🌐 Web query:

Nemotron-H NemotronHMoE NemotronHTopkRouter router fields n_group topk_group site:huggingface.co

💡 Result:

Nemotron-H (NemotronH) is a hybrid Mamba-Transformer architecture from NVIDIA that includes Mixture-of-Experts (MoE) layers. The MoE uses a top-k router called NemotronHTopkRouter. Key router fields from NemotronHConfig: - n_group (int, default 1): Number of groups for expert routing. Experts are partitioned into this many groups. - topk_group (int, default 1): Number of selected groups for each token. Ensures selected experts are only within topk_group groups, implementing grouped top-k selection to reduce communication or improve efficiency. - Related: num_experts_per_tok (int, default 2): Top-k value, number of experts routed per token. - n_routed_experts (int, default 8): Total number of routed experts. The NemotronHTopkRouter computes routing logits (linear projection + sigmoid), selects top-K experts using grouped strategy (first top topk_group groups, then top num_experts_per_tok experts within), gathers weights, optionally normalizes, and applies scaling. This supports load balancing and group-based selection for hardware efficiency in distributed setups.

Citations:


🏁 Script executed:

rg -A 5 -B 5 "n_group|topk_group|norm_topk_prob|routed_scaling_factor" src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py | head -80

Repository: axolotl-ai-cloud/axolotl

Length of output: 2928


🏁 Script executed:

# Check if there are any tests or references to Nemotron-H in the codebase
rg -i "nemotron" src/

Repository: axolotl-ai-cloud/axolotl

Length of output: 1235


Nemotron-H ScatterMoE routing still misses gate-side router fields.

These expert-side changes enable the non-GLU MLP path, but _sigmoid_topk_route() only falls back to base_gate for top_k. Upstream NemotronHTopkRouter also stores n_group, topk_group, norm_topk_prob, and routed_scaling_factor on the router, so Lines 288-316 will skip grouped selection and use default scaling for Nemotron-H unless those fields also fall back to base_gate.

Suggested fix
 def _sigmoid_topk_route(
     moe_block, base_gate, hidden_states, gate_weight, gate_lora_delta
 ):
@@
-    num_experts = getattr(moe_block, "n_routed_experts", gate_weight.shape[0])
+    num_experts = getattr(
+        moe_block,
+        "n_routed_experts",
+        getattr(base_gate, "n_routed_experts", gate_weight.shape[0]),
+    )
@@
-    n_group = getattr(moe_block, "n_group", 1)
+    n_group = getattr(moe_block, "n_group", getattr(base_gate, "n_group", 1))
     if n_group > 1:
@@
-        topk_group = getattr(moe_block, "topk_group", n_group)
+        topk_group = getattr(
+            moe_block, "topk_group", getattr(base_gate, "topk_group", n_group)
+        )
@@
-    if getattr(moe_block, "norm_topk_prob", True):
+    if getattr(
+        moe_block, "norm_topk_prob", getattr(base_gate, "norm_topk_prob", True)
+    ):
         topk_weights = topk_weights / (topk_weights.sum(dim=-1, keepdim=True) + 1e-20)
-    routed_scaling_factor = getattr(moe_block, "routed_scaling_factor", 1.0)
+    routed_scaling_factor = getattr(
+        moe_block,
+        "routed_scaling_factor",
+        getattr(base_gate, "routed_scaling_factor", 1.0),
+    )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py` around lines
494 - 500, _sigmoid_topk_route currently only falls back to base_gate for top_k,
causing Nemotron-H routers to miss other router-side fields; update
_sigmoid_topk_route to also fallback to base_gate for the router attributes
n_group, topk_group, norm_topk_prob, and routed_scaling_factor when the detected
expert is non-GLU (match the behavior of NemotronHTopkRouter), so grouped
selection and scaling use the same defaults as the base_gate; check for these
attributes on the router object and assign from base_gate when missing, keeping
existing logic for top_k fallback intact.

Comment on lines +60 to +62
# Non-GLU MoE (no gate_proj, experts use up_proj + down_proj only)
elif model_type in ("nemotron_h",):
return sigmoid_topk_routing, ActivationType.RELU_SQ, "gate"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's find and examine the routing.py file
find . -name "routing.py" -type f | head -20

Repository: axolotl-ai-cloud/axolotl

Length of output: 122


🏁 Script executed:

# Also check the sonicmoe directory structure
find . -path "*/sonicmoe*" -type f | head -30

Repository: axolotl-ai-cloud/axolotl

Length of output: 295


🏁 Script executed:

# Let's look at the actual code at lines 60-62
head -70 src/axolotl/integrations/kernels/sonicmoe/routing.py | tail -20

Repository: axolotl-ai-cloud/axolotl

Length of output: 955


🏁 Script executed:

# Let's look at the sigmoid_topk_routing function
cat -n src/axolotl/integrations/kernels/sonicmoe/routing.py | head -250

Repository: axolotl-ai-cloud/axolotl

Length of output: 12192


🏁 Script executed:

# Continue reading the sigmoid_topk_routing function
cat -n src/axolotl/integrations/kernels/sonicmoe/routing.py | sed -n '250,300p'

Repository: axolotl-ai-cloud/axolotl

Length of output: 1679


🏁 Script executed:

# Let's search for Nemotron-H model files in the codebase
find . -type f -name "*nemotron*" | head -20

Repository: axolotl-ai-cloud/axolotl

Length of output: 115


🏁 Script executed:

# Search for any references to NemotronH in the codebase
rg -i "nemotron" --type py -A 5 -B 5 | head -100

Repository: axolotl-ai-cloud/axolotl

Length of output: 8785


🏁 Script executed:

# Let's check what Nemotron-H MoE structure looks like in the example config
cat examples/nemotron/nemotron-mini-4b-qlora.yaml | head -50

Repository: axolotl-ai-cloud/axolotl

Length of output: 877


🏁 Script executed:

# Now let's search for how sigmoid_topk_routing is currently being used or tested
rg "sigmoid_topk_routing" -A 10 -B 5 --type py

Repository: axolotl-ai-cloud/axolotl

Length of output: 29558


🏁 Script executed:

# Let's check the test helpers to understand how moe_block is structured
cat -n tests/integrations/test_sonicmoe.py | head -200

Repository: axolotl-ai-cloud/axolotl

Length of output: 8887


🏁 Script executed:

# Search for _make_glm_moe_block and _make_minimax_m2_moe_block to understand the block structure
rg "_make_glm_moe_block|_make_minimax_m2_moe_block" -A 20 --type py tests/

Repository: axolotl-ai-cloud/axolotl

Length of output: 9840


🌐 Web query:

Nemotron-H model structure top_k gate router attributes huggingface

💡 Result:

The Nemotron-H models are hybrid Mamba-Transformer architectures without MoE layers, using standard MLP (FFN) layers instead. They do not have a top_k gate router. Nemotron 3 Nano (30B-A3B), built on Nemotron-H, introduces sparse MoE layers replacing FFNs. It uses a learned MLP router (sigmoid gating) that activates the top 6 out of 128 routable experts per token, plus 2 shared experts, for ~3.6B active parameters per token from 31.6B total. Hugging Face NemotronHConfig supports MoE via layers_block_type including "moe", with num_experts_per_tok (default 2, top-k routing parameter), n_routed_experts (default 8), topk_group (default 1), and norm_topk_prob for normalizing top-k weights. Specific configs for Nemotron-H models lack MoE params as they use dense MLPs; MoE appears in Nemotron 3 models like Nano.

Citations:


🏁 Script executed:

# Also check if there are any local references or imports from the transformers library
rg "NemotronH|nemotron_h" -B 5 -A 10 --type py | grep -i "top_k\|n_group\|router\|gate"

Repository: axolotl-ai-cloud/axolotl

Length of output: 1163


🏁 Script executed:

# Let's check if there are any actual Nemotron-H MoE model files or tests
find . -type f \( -name "*.py" -o -name "*.yaml" \) -exec grep -l "nemotron" {} \;

Repository: axolotl-ai-cloud/axolotl

Length of output: 257


🏁 Script executed:

# Search for where top_k is defined or accessed in the routing code for other models
rg "top_k" -B 2 -A 2 src/axolotl/integrations/kernels/sonicmoe/routing.py

Repository: axolotl-ai-cloud/axolotl

Length of output: 1169


🏁 Script executed:

# Let's verify the actual parameter access patterns more carefully
# Check what parameters sigmoid_topk_routing expects and doesn't have fallbacks for
cat -n src/axolotl/integrations/kernels/sonicmoe/routing.py | sed -n '217,270p'

Repository: axolotl-ai-cloud/axolotl

Length of output: 2926


🏁 Script executed:

# Check the other routing functions to see their patterns
cat -n src/axolotl/integrations/kernels/sonicmoe/routing.py | sed -n '84,145p'

Repository: axolotl-ai-cloud/axolotl

Length of output: 3079


🏁 Script executed:

# Let's check the docstring for sigmoid_topk_routing to see what it documents as expected parameters
cat -n src/axolotl/integrations/kernels/sonicmoe/routing.py | sed -n '191,216p'

Repository: axolotl-ai-cloud/axolotl

Length of output: 1425


🏁 Script executed:

# Let's look at the patch.py to see how the routing function is called with the actual moe_block
cat -n src/axolotl/integrations/kernels/sonicmoe/patch.py | head -150

Repository: axolotl-ai-cloud/axolotl

Length of output: 6967


🏁 Script executed:

# Check if there's any model-specific handling or initialization for Nemotron-H
rg "nemotron" -B 10 -A 10 --type py src/

Repository: axolotl-ai-cloud/axolotl

Length of output: 5529


Add fallback logic for top_k and topk_group in sigmoid_topk_routing.

Lines 219 and 247 access moe_block.top_k and moe_block.topk_group directly without fallback, while other optional parameters in the same function (lines 263, 266) use getattr() with defaults. The function docstring marks these as optional, yet the code enforces them on moe_block only. Use getattr() with gate as secondary source to match the pattern established for e_score_correction_bias (lines 229-231):

Suggested fix
 def sigmoid_topk_routing(
     hidden_states: torch.Tensor, moe_block
 ) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
     gate = moe_block.gate
     T, H = hidden_states.shape
-    K = moe_block.top_k
+    K = getattr(moe_block, "top_k", getattr(gate, "top_k", None))
+    if K is None:
+        raise AttributeError(
+            f"sigmoid_topk_routing requires top_k on moe_block or gate, "
+            f"but neither has it"
+        )
     E = getattr(
         moe_block,
         "n_routed_experts",
         getattr(gate, "n_routed_experts", gate.weight.shape[0]),
     )
     n_group = getattr(moe_block, "n_group", getattr(gate, "n_group", 1))
     
     # ... (lines 223-246 unchanged) ...
     
     if n_group > 1:
         # ... (lines 241-245 unchanged) ...
-        group_idx = torch.topk(
-            group_scores, k=moe_block.topk_group, dim=-1, sorted=False
-        )[1]
+        topk_group = getattr(
+            moe_block, "topk_group", getattr(gate, "topk_group", n_group)
+        )
+        group_idx = torch.topk(group_scores, k=topk_group, dim=-1, sorted=False)[1]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Non-GLU MoE (no gate_proj, experts use up_proj + down_proj only)
elif model_type in ("nemotron_h",):
return sigmoid_topk_routing, ActivationType.RELU_SQ, "gate"
def sigmoid_topk_routing(
hidden_states: torch.Tensor, moe_block
) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
gate = moe_block.gate
T, H = hidden_states.shape
K = getattr(moe_block, "top_k", getattr(gate, "top_k", None))
if K is None:
raise AttributeError(
f"sigmoid_topk_routing requires top_k on moe_block or gate, "
f"but neither has it"
)
E = getattr(
moe_block,
"n_routed_experts",
getattr(gate, "n_routed_experts", gate.weight.shape[0]),
)
n_group = getattr(moe_block, "n_group", getattr(gate, "n_group", 1))
# ... rest of function with change at topk_group access ...
if n_group > 1:
# ...
topk_group = getattr(
moe_block, "topk_group", getattr(gate, "topk_group", n_group)
)
group_idx = torch.topk(group_scores, k=topk_group, dim=-1, sorted=False)[1]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/axolotl/integrations/kernels/sonicmoe/routing.py` around lines 60 - 62,
sigmoid_topk_routing currently accesses moe_block.top_k and moe_block.topk_group
directly which can be missing; change those accesses to use getattr(moe_block,
"top_k", getattr(gate, "top_k", <default>)) and getattr(moe_block, "topk_group",
getattr(gate, "topk_group", <default>)) so they follow the existing fallback
pattern used for e_score_correction_bias and other optional params; update the
references in sigmoid_topk_routing to use these getattr calls (and pick the same
sensible defaults used elsewhere in the function) so missing attributes on
moe_block fall back to gate before defaulting.

@ved1beta

Copy link
Copy Markdown
Member

tried running on #3508 with scatter working fine 🫡

@winglian

Copy link
Copy Markdown
Collaborator

Did we run any correctness tests/checks for the scattermoe changes?

@ved1beta

Copy link
Copy Markdown
Member

the training runs look similar i thnk that should be enough
image

@codecov

codecov Bot commented Mar 31, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 0% with 51 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/axolotl/integrations/kernels/sonicmoe/patch.py 0.00% 28 Missing ⚠️
...ntegrations/kernels/libs/scattermoe_lora/layers.py 0.00% 21 Missing ⚠️
...c/axolotl/integrations/kernels/sonicmoe/routing.py 0.00% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@winglian

winglian commented Apr 1, 2026

Copy link
Copy Markdown
Collaborator

we have some tests in tests/integrations/test_scattermoe_lora*.py and tests/e2e/integrations/test_scattermoe_lora_kernels.py` that we should include the new functionality

@NanoCode012 NanoCode012 added the wip label Apr 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants