feat: add moe kernel support for non-glu by NanoCode012 · Pull Request #3558 · axolotl-ai-cloud/axolotl

NanoCode012 · 2026-03-30T10:46:38Z

Description

This is required for NemotronH and can be useful for any future models.

Sonicmoe has internal handling for non-glu, we just needed to patch our end to pass it properly.
Scattermoe required patching to handle this.

Motivation and Context

How has this been tested?

AI Usage Disclaimer

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

Summary by CodeRabbit

New Features
- Added support for the nemotron_h model type with optimized Mixture of Experts routing
- Extended expert layer handling to support both GLU and non-GLU expert architectures
- Enhanced MoE integration with optional latent projection layers for improved computation paths

coderabbitai · 2026-03-30T10:46:51Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4a104c73-f480-49a5-a4e1-fcce0a1bff61

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

Added support for the "nemotron_h" MoE model type with GLU/non-GLU expert architecture flexibility in ScatterMoE and SonicMoE kernel integrations, including conditional weight converter registration and latent projection layer support.

Changes

Cohort / File(s)	Summary
Model Registration `src/axolotl/integrations/kernels/constants.py`, `src/axolotl/integrations/kernels/sonicmoe/routing.py`	Added "nemotron_h" model type mapping to `SPARSE_MOE_BLOCK` and added corresponding routing configuration with sigmoid_topk_routing and RELU_SQ activation in `get_model_moe_config`.
ScatterMoE LoRA Expert Handling `src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py`	Refactored to support both GLU and non-GLU expert architectures. Added GLU detection via `hasattr(experts, "gate_up_proj")`, conditional LoRA weight extraction from either `gate_up_proj` or `up_proj`, and updated activation logic to handle multiplicative gating for GLU vs direct activation for non-GLU. Added optional latent projection layers (`fc1_latent_proj`, `fc2_latent_proj`).
SonicMoE Kernel Patch `src/axolotl/integrations/kernels/sonicmoe/patch.py`	Implemented GLU detection and conditional weight converter registration. Added latent projection layers before/after expert routing, and conditional expert weight selection between `gate_up_proj` (GLU) and `up_proj` (non-GLU) for SonicMoE kernel inputs.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Add support for batched_mm, grouped_mm and scattermoe for MoE models #3377 — Extends ScatterMoE/MoE kernel integration with overlapping modifications to HFScatterMoEGatedMLP handling and model-type routing.
feat: add custom routing support for ernie4_5_moe, and hunyuan_v1_moe #3526 — Registers new MoE model types via SPARSE_MOE_BLOCK and sonicmoe/routing.py, directly parallel to nemotron_h registration in this PR.
add gpu correctness tests for scattermoe-lora #3474 — GPU correctness tests exercise the same ScatterMoE LoRA expert handling code paths modified here (gate_up_proj vs up_proj selection).

Suggested reviewers

winglian
SalmanMohammadi

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 77.78% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat: add moe kernel support for non-glu' accurately describes the main change: adding support for non-GLU MoE kernel configurations across multiple files.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py`:
- Around line 494-500: _sigmoid_topk_route currently only falls back to
base_gate for top_k, causing Nemotron-H routers to miss other router-side
fields; update _sigmoid_topk_route to also fallback to base_gate for the router
attributes n_group, topk_group, norm_topk_prob, and routed_scaling_factor when
the detected expert is non-GLU (match the behavior of NemotronHTopkRouter), so
grouped selection and scaling use the same defaults as the base_gate; check for
these attributes on the router object and assign from base_gate when missing,
keeping existing logic for top_k fallback intact.

In `@src/axolotl/integrations/kernels/sonicmoe/routing.py`:
- Around line 60-62: sigmoid_topk_routing currently accesses moe_block.top_k and
moe_block.topk_group directly which can be missing; change those accesses to use
getattr(moe_block, "top_k", getattr(gate, "top_k", <default>)) and
getattr(moe_block, "topk_group", getattr(gate, "topk_group", <default>)) so they
follow the existing fallback pattern used for e_score_correction_bias and other
optional params; update the references in sigmoid_topk_routing to use these
getattr calls (and pick the same sensible defaults used elsewhere in the
function) so missing attributes on moe_block fall back to gate before
defaulting.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f0b62098-6468-40b3-a997-5a3912992ad9

📥 Commits

Reviewing files that changed from the base of the PR and between 00dee05 and c8d05de.

📒 Files selected for processing (4)

src/axolotl/integrations/kernels/constants.py
src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py
src/axolotl/integrations/kernels/sonicmoe/patch.py
src/axolotl/integrations/kernels/sonicmoe/routing.py

coderabbitai · 2026-03-30T10:59:47Z

+        # ====================================================================
+        # Detect GLU vs non-GLU expert architecture
+        # ====================================================================
+        # GLU models (Qwen, Mixtral, etc.): gate_up_proj [E, 2*I, H]
+        # Non-GLU models (Nemotron-H, etc.): up_proj [E, I, H]
+        has_glu = hasattr(experts, "gate_up_proj")
+        up_proj_name = "gate_up_proj" if has_glu else "up_proj"


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

find . -name "*.py" -path "*scattermoe*" -type f

Repository: axolotl-ai-cloud/axolotl

Length of output: 1190

🏁 Script executed:

cat -n src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py | head -50

Repository: axolotl-ai-cloud/axolotl

Length of output: 2481

🏁 Script executed:

wc -l src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py

Repository: axolotl-ai-cloud/axolotl

Length of output: 135

🏁 Script executed:

sed -n '280,320p' src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py

Repository: axolotl-ai-cloud/axolotl

Length of output: 1788

🏁 Script executed:

sed -n '494,510p' src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py

Repository: axolotl-ai-cloud/axolotl

Length of output: 1041

🏁 Script executed:

grep -n "_sigmoid_topk_route" src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py

Repository: axolotl-ai-cloud/axolotl

Length of output: 136

🏁 Script executed:

sed -n '253,330p' src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py

Repository: axolotl-ai-cloud/axolotl

Length of output: 3436

🏁 Script executed:

sed -n '253,250p' src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py

Repository: axolotl-ai-cloud/axolotl

Length of output: 92

🏁 Script executed:

sed -n '253,340p' src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py

Repository: axolotl-ai-cloud/axolotl

Length of output: 3774

🏁 Script executed:

sed -n '195,215p' src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py

Repository: axolotl-ai-cloud/axolotl

Length of output: 972

🏁 Script executed:

sed -n '330,340p' src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py

Repository: axolotl-ai-cloud/axolotl

Length of output: 411

🏁 Script executed:

rg -A 10 "def _unwrap_experts_lora" src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py

Repository: axolotl-ai-cloud/axolotl

Length of output: 530

🌐 Web query:

Nemotron-H NemotronHMoE NemotronHTopkRouter router fields n_group topk_group site:huggingface.co

💡 Result:

Nemotron-H (NemotronH) is a hybrid Mamba-Transformer architecture from NVIDIA that includes Mixture-of-Experts (MoE) layers. The MoE uses a top-k router called NemotronHTopkRouter. Key router fields from NemotronHConfig: - n_group (int, default 1): Number of groups for expert routing. Experts are partitioned into this many groups. - topk_group (int, default 1): Number of selected groups for each token. Ensures selected experts are only within topk_group groups, implementing grouped top-k selection to reduce communication or improve efficiency. - Related: num_experts_per_tok (int, default 2): Top-k value, number of experts routed per token. - n_routed_experts (int, default 8): Total number of routed experts. The NemotronHTopkRouter computes routing logits (linear projection + sigmoid), selects top-K experts using grouped strategy (first top topk_group groups, then top num_experts_per_tok experts within), gathers weights, optionally normalizes, and applies scaling. This supports load balancing and group-based selection for hardware efficiency in distributed setups.

Citations:

1: https://huggingface.co/docs/transformers/main/model_doc/nemotron_h

2: https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16/commit/40b9aefd60876e9c31e92feffc9455cbb0b06718

3: https://huggingface.co/docs/transformers/model_doc/nemotron_h

4: https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4/blob/main/modeling_nemotron_h.py

5: https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8/blame/a0ce4e9abcdbaeecf6481ac8f5fbc3ef7a2d8f98/modeling_nemotron_h.py

6: https://huggingface.co/sjug/Nemotron-3-Super-120B-A12B-MLX-4bit/resolve/main/configuration_nemotron_h.py?download=true

🏁 Script executed:

rg -A 5 -B 5 "n_group|topk_group|norm_topk_prob|routed_scaling_factor" src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py | head -80

Repository: axolotl-ai-cloud/axolotl

Length of output: 2928

🏁 Script executed:

# Check if there are any tests or references to Nemotron-H in the codebase rg -i "nemotron" src/

Repository: axolotl-ai-cloud/axolotl

Length of output: 1235

Nemotron-H ScatterMoE routing still misses gate-side router fields.

These expert-side changes enable the non-GLU MLP path, but _sigmoid_topk_route() only falls back to base_gate for top_k. Upstream NemotronHTopkRouter also stores n_group, topk_group, norm_topk_prob, and routed_scaling_factor on the router, so Lines 288-316 will skip grouped selection and use default scaling for Nemotron-H unless those fields also fall back to base_gate.

Suggested fix

def _sigmoid_topk_route( moe_block, base_gate, hidden_states, gate_weight, gate_lora_delta ): @@ - num_experts = getattr(moe_block, "n_routed_experts", gate_weight.shape[0]) + num_experts = getattr( + moe_block, + "n_routed_experts", + getattr(base_gate, "n_routed_experts", gate_weight.shape[0]), + ) @@ - n_group = getattr(moe_block, "n_group", 1) + n_group = getattr(moe_block, "n_group", getattr(base_gate, "n_group", 1)) if n_group > 1: @@ - topk_group = getattr(moe_block, "topk_group", n_group) + topk_group = getattr( + moe_block, "topk_group", getattr(base_gate, "topk_group", n_group) + ) @@ - if getattr(moe_block, "norm_topk_prob", True): + if getattr( + moe_block, "norm_topk_prob", getattr(base_gate, "norm_topk_prob", True) + ): topk_weights = topk_weights / (topk_weights.sum(dim=-1, keepdim=True) + 1e-20) - routed_scaling_factor = getattr(moe_block, "routed_scaling_factor", 1.0) + routed_scaling_factor = getattr( + moe_block, + "routed_scaling_factor", + getattr(base_gate, "routed_scaling_factor", 1.0), + )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py` around lines 494 - 500, _sigmoid_topk_route currently only falls back to base_gate for top_k, causing Nemotron-H routers to miss other router-side fields; update _sigmoid_topk_route to also fallback to base_gate for the router attributes n_group, topk_group, norm_topk_prob, and routed_scaling_factor when the detected expert is non-GLU (match the behavior of NemotronHTopkRouter), so grouped selection and scaling use the same defaults as the base_gate; check for these attributes on the router object and assign from base_gate when missing, keeping existing logic for top_k fallback intact.

coderabbitai · 2026-03-30T10:59:47Z

+    # Non-GLU MoE (no gate_proj, experts use up_proj + down_proj only)
+    elif model_type in ("nemotron_h",):
+        return sigmoid_topk_routing, ActivationType.RELU_SQ, "gate"


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's find and examine the routing.py file find . -name "routing.py" -type f | head -20

Repository: axolotl-ai-cloud/axolotl

Length of output: 122

🏁 Script executed:

# Also check the sonicmoe directory structure find . -path "*/sonicmoe*" -type f | head -30

Repository: axolotl-ai-cloud/axolotl

Length of output: 295

🏁 Script executed:

# Let's look at the actual code at lines 60-62 head -70 src/axolotl/integrations/kernels/sonicmoe/routing.py | tail -20

Repository: axolotl-ai-cloud/axolotl

Length of output: 955

🏁 Script executed:

# Let's look at the sigmoid_topk_routing function cat -n src/axolotl/integrations/kernels/sonicmoe/routing.py | head -250

Repository: axolotl-ai-cloud/axolotl

Length of output: 12192

🏁 Script executed:

# Continue reading the sigmoid_topk_routing function cat -n src/axolotl/integrations/kernels/sonicmoe/routing.py | sed -n '250,300p'

Repository: axolotl-ai-cloud/axolotl

Length of output: 1679

🏁 Script executed:

# Let's search for Nemotron-H model files in the codebase find . -type f -name "*nemotron*" | head -20

Repository: axolotl-ai-cloud/axolotl

Length of output: 115

🏁 Script executed:

# Search for any references to NemotronH in the codebase rg -i "nemotron" --type py -A 5 -B 5 | head -100

Repository: axolotl-ai-cloud/axolotl

Length of output: 8785

🏁 Script executed:

# Let's check what Nemotron-H MoE structure looks like in the example config cat examples/nemotron/nemotron-mini-4b-qlora.yaml | head -50

Repository: axolotl-ai-cloud/axolotl

Length of output: 877

🏁 Script executed:

# Now let's search for how sigmoid_topk_routing is currently being used or tested rg "sigmoid_topk_routing" -A 10 -B 5 --type py

Repository: axolotl-ai-cloud/axolotl

Length of output: 29558

🏁 Script executed:

# Let's check the test helpers to understand how moe_block is structured cat -n tests/integrations/test_sonicmoe.py | head -200

Repository: axolotl-ai-cloud/axolotl

Length of output: 8887

🏁 Script executed:

# Search for _make_glm_moe_block and _make_minimax_m2_moe_block to understand the block structure rg "_make_glm_moe_block|_make_minimax_m2_moe_block" -A 20 --type py tests/

Repository: axolotl-ai-cloud/axolotl

Length of output: 9840

🌐 Web query:

Nemotron-H model structure top_k gate router attributes huggingface

💡 Result:

The Nemotron-H models are hybrid Mamba-Transformer architectures without MoE layers, using standard MLP (FFN) layers instead. They do not have a top_k gate router. Nemotron 3 Nano (30B-A3B), built on Nemotron-H, introduces sparse MoE layers replacing FFNs. It uses a learned MLP router (sigmoid gating) that activates the top 6 out of 128 routable experts per token, plus 2 shared experts, for ~3.6B active parameters per token from 31.6B total. Hugging Face NemotronHConfig supports MoE via layers_block_type including "moe", with num_experts_per_tok (default 2, top-k routing parameter), n_routed_experts (default 8), topk_group (default 1), and norm_topk_prob for normalizing top-k weights. Specific configs for Nemotron-H models lack MoE params as they use dense MLPs; MoE appears in Nemotron 3 models like Nano.

Citations:

1: https://research.nvidia.com/labs/adlr/nemotronh/

2: https://arxiv.org/html/2504.03624v1

3: https://huggingface.co/docs/transformers/model_doc/nemotron_h

4: https://huggingface.co/docs/transformers/main/model_doc/nemotron_h

5: https://arxiv.org/html/2512.20848v1

6: https://huggingface.co/sjug/Nemotron-3-Super-120B-A12B-MLX-4bit/resolve/main/configuration_nemotron_h.py?download=true

7: https://arxiv.org/pdf/2512.20848

8: https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4/resolve/refs%2Fpr%2F1/configuration_nemotron_h.py?download=true

9: https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4/blob/refs%2Fpr%2F1/configuration_nemotron_h.py

10: https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8/blame/refs%2Fpr%2F22/configuration_nemotron_h.py

11: https://huggingface.co/blog/nvidia/nemotron-3-nano-efficient-open-intelligent-models

🏁 Script executed:

# Also check if there are any local references or imports from the transformers library rg "NemotronH|nemotron_h" -B 5 -A 10 --type py | grep -i "top_k\|n_group\|router\|gate"

Repository: axolotl-ai-cloud/axolotl

Length of output: 1163

🏁 Script executed:

# Let's check if there are any actual Nemotron-H MoE model files or tests find . -type f \( -name "*.py" -o -name "*.yaml" \) -exec grep -l "nemotron" {} \;

Repository: axolotl-ai-cloud/axolotl

Length of output: 257

🏁 Script executed:

# Search for where top_k is defined or accessed in the routing code for other models rg "top_k" -B 2 -A 2 src/axolotl/integrations/kernels/sonicmoe/routing.py

Repository: axolotl-ai-cloud/axolotl

Length of output: 1169

🏁 Script executed:

# Let's verify the actual parameter access patterns more carefully # Check what parameters sigmoid_topk_routing expects and doesn't have fallbacks for cat -n src/axolotl/integrations/kernels/sonicmoe/routing.py | sed -n '217,270p'

Repository: axolotl-ai-cloud/axolotl

Length of output: 2926

🏁 Script executed:

# Check the other routing functions to see their patterns cat -n src/axolotl/integrations/kernels/sonicmoe/routing.py | sed -n '84,145p'

Repository: axolotl-ai-cloud/axolotl

Length of output: 3079

🏁 Script executed:

# Let's check the docstring for sigmoid_topk_routing to see what it documents as expected parameters cat -n src/axolotl/integrations/kernels/sonicmoe/routing.py | sed -n '191,216p'

Repository: axolotl-ai-cloud/axolotl

Length of output: 1425

🏁 Script executed:

# Let's look at the patch.py to see how the routing function is called with the actual moe_block cat -n src/axolotl/integrations/kernels/sonicmoe/patch.py | head -150

Repository: axolotl-ai-cloud/axolotl

Length of output: 6967

🏁 Script executed:

# Check if there's any model-specific handling or initialization for Nemotron-H rg "nemotron" -B 10 -A 10 --type py src/

Repository: axolotl-ai-cloud/axolotl

Length of output: 5529

Add fallback logic for top_k and topk_group in sigmoid_topk_routing.

Lines 219 and 247 access moe_block.top_k and moe_block.topk_group directly without fallback, while other optional parameters in the same function (lines 263, 266) use getattr() with defaults. The function docstring marks these as optional, yet the code enforces them on moe_block only. Use getattr() with gate as secondary source to match the pattern established for e_score_correction_bias (lines 229-231):

Suggested fix

def sigmoid_topk_routing( hidden_states: torch.Tensor, moe_block ) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]: gate = moe_block.gate T, H = hidden_states.shape - K = moe_block.top_k + K = getattr(moe_block, "top_k", getattr(gate, "top_k", None)) + if K is None: + raise AttributeError( + f"sigmoid_topk_routing requires top_k on moe_block or gate, " + f"but neither has it" + ) E = getattr( moe_block, "n_routed_experts", getattr(gate, "n_routed_experts", gate.weight.shape[0]), ) n_group = getattr(moe_block, "n_group", getattr(gate, "n_group", 1)) # ... (lines 223-246 unchanged) ... if n_group > 1: # ... (lines 241-245 unchanged) ... - group_idx = torch.topk( - group_scores, k=moe_block.topk_group, dim=-1, sorted=False - )[1] + topk_group = getattr( + moe_block, "topk_group", getattr(gate, "topk_group", n_group) + ) + group_idx = torch.topk(group_scores, k=topk_group, dim=-1, sorted=False)[1]

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Non-GLU MoE (no gate_proj, experts use up_proj + down_proj only)

elif model_type in ("nemotron_h",):

return sigmoid_topk_routing, ActivationType.RELU_SQ, "gate"

def sigmoid_topk_routing(

hidden_states: torch.Tensor, moe_block

) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:

gate = moe_block.gate

T, H = hidden_states.shape

K = getattr(moe_block, "top_k", getattr(gate, "top_k", None))

if K is None:

raise AttributeError(

f"sigmoid_topk_routing requires top_k on moe_block or gate, "

f"but neither has it"

)

E = getattr(

moe_block,

"n_routed_experts",

getattr(gate, "n_routed_experts", gate.weight.shape[0]),

)

n_group = getattr(moe_block, "n_group", getattr(gate, "n_group", 1))

# ... rest of function with change at topk_group access ...

if n_group > 1:

# ...

topk_group = getattr(

moe_block, "topk_group", getattr(gate, "topk_group", n_group)

)

group_idx = torch.topk(group_scores, k=topk_group, dim=-1, sorted=False)[1]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/axolotl/integrations/kernels/sonicmoe/routing.py` around lines 60 - 62, sigmoid_topk_routing currently accesses moe_block.top_k and moe_block.topk_group directly which can be missing; change those accesses to use getattr(moe_block, "top_k", getattr(gate, "top_k", <default>)) and getattr(moe_block, "topk_group", getattr(gate, "topk_group", <default>)) so they follow the existing fallback pattern used for e_score_correction_bias and other optional params; update the references in sigmoid_topk_routing to use these getattr calls (and pick the same sensible defaults used elsewhere in the function) so missing attributes on moe_block fall back to gate before defaulting.

ved1beta · 2026-03-30T13:48:04Z

tried running on #3508 with scatter working fine 🫡

winglian · 2026-03-30T22:12:05Z

Did we run any correctness tests/checks for the scattermoe changes?

ved1beta · 2026-03-31T06:23:13Z

the training runs look similar i thnk that should be enough

codecov · 2026-03-31T14:52:56Z

Codecov Report

❌ Patch coverage is 0% with 51 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/axolotl/integrations/kernels/sonicmoe/patch.py	0.00%	28 Missing ⚠️
...ntegrations/kernels/libs/scattermoe_lora/layers.py	0.00%	21 Missing ⚠️
...c/axolotl/integrations/kernels/sonicmoe/routing.py	0.00%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

winglian · 2026-04-01T13:17:47Z

we have some tests in tests/integrations/test_scattermoe_lora*.py and tests/e2e/integrations/test_scattermoe_lora_kernels.py` that we should include the new functionality

feat: add moe kernel support for non-glu

c8d05de

coderabbitai Bot reviewed Mar 30, 2026

View reviewed changes

Merge branch 'main' into feat/non-glu-kernel

2df8af8

fix: triton error with hidden states upcasted

3182d00

NanoCode012 added 2 commits April 2, 2026 11:39

feat: add non-glu test

f70b6af

Merge branch 'main' into feat/non-glu-kernel

60215e8

NanoCode012 added the wip label Apr 7, 2026

-    # Non-GLU MoE (no gate_proj, experts use up_proj + down_proj only)
-    elif model_type in ("nemotron_h",):
-        return sigmoid_topk_routing, ActivationType.RELU_SQ, "gate"
+def sigmoid_topk_routing(
+    hidden_states: torch.Tensor, moe_block
+) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
+    gate = moe_block.gate
+    T, H = hidden_states.shape
+    K = getattr(moe_block, "top_k", getattr(gate, "top_k", None))
+    if K is None:
+        raise AttributeError(
+            f"sigmoid_topk_routing requires top_k on moe_block or gate, "
+            f"but neither has it"
+        )
+    E = getattr(
+        moe_block,
+        "n_routed_experts",
+        getattr(gate, "n_routed_experts", gate.weight.shape[0]),
+    )
+    n_group = getattr(moe_block, "n_group", getattr(gate, "n_group", 1))
+    # ... rest of function with change at topk_group access ...
+    if n_group > 1:
+        # ...
+        topk_group = getattr(
+            moe_block, "topk_group", getattr(gate, "topk_group", n_group)
+        )
+        group_idx = torch.topk(group_scores, k=topk_group, dim=-1, sorted=False)[1]

Uh oh!

Conversation

NanoCode012 commented Mar 30, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

How has this been tested?

AI Usage Disclaimer

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

ved1beta commented Mar 30, 2026

Uh oh!

winglian commented Mar 30, 2026

Uh oh!

ved1beta commented Mar 31, 2026

Uh oh!

codecov Bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

winglian commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

NanoCode012 commented Mar 30, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 30, 2026 •

edited

Loading

codecov Bot commented Mar 31, 2026 •

edited

Loading