Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions src/axolotl/integrations/kernels/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,10 @@
"glm4v_moe": "Glm4vMoeTextMoE",
# sigmoid -> topk routing (no group selection)
"minimax_m2": "MiniMaxM2SparseMoeBlock",
# Non-GLU MoE (no gate_proj, experts have up_proj + down_proj only)
"nemotron_h": "NemotronHMoE",
# Models below need custom routing (not yet implemented):
# "deepseek_v2": "DeepseekV2Moe", # softmax->topk, group_limited_greedy, different attr names (num_group)
# softmax->topk, e_score_correction_bias between softmax and topk
"ernie4_5_moe": "Ernie4_5_MoeSparseMoeBlock",
# softmax->topk, group_limited_greedy, different attr names (num_group)
Expand Down
62 changes: 48 additions & 14 deletions src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py
Original file line number Diff line number Diff line change
Expand Up @@ -196,12 +196,14 @@ def _unwrap_experts_lora(experts_module):
if num_experts is None:
# Fallback: infer from parameter shape
gup = getattr(base_experts, "gate_up_proj", None)
if gup is None:
gup = getattr(base_experts, "up_proj", None)
if gup is not None:
num_experts = gup.shape[0]

# Extract gate_up_proj LoRA (needs A<->B swap due to transposition)
# Extract gate_up_proj (or up_proj for non-GLU) LoRA
gup_lora = None
gup_wrapper = wrappers.get("gate_up_proj")
gup_wrapper = wrappers.get("gate_up_proj") or wrappers.get("up_proj")
if gup_wrapper is not None:
lora_A, lora_B, scaling = get_lora_params_from_wrapper(gup_wrapper)
if lora_A is not None:
Expand Down Expand Up @@ -489,6 +491,21 @@ def forward(self: nn.Module, layer_input: torch.Tensor):
# ====================================================================
experts, gup_lora, down_lora = _unwrap_experts_lora(self.experts)

# ====================================================================
# Detect GLU vs non-GLU expert architecture
# ====================================================================
# GLU models (Qwen, Mixtral, etc.): gate_up_proj [E, 2*I, H]
# Non-GLU models (Nemotron-H, etc.): up_proj [E, I, H]
has_glu = hasattr(experts, "gate_up_proj")
up_proj_name = "gate_up_proj" if has_glu else "up_proj"
Comment on lines +494 to +500

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

find . -name "*.py" -path "*scattermoe*" -type f

Repository: axolotl-ai-cloud/axolotl

Length of output: 1190


🏁 Script executed:

cat -n src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py | head -50

Repository: axolotl-ai-cloud/axolotl

Length of output: 2481


🏁 Script executed:

wc -l src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py

Repository: axolotl-ai-cloud/axolotl

Length of output: 135


🏁 Script executed:

sed -n '280,320p' src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py

Repository: axolotl-ai-cloud/axolotl

Length of output: 1788


🏁 Script executed:

sed -n '494,510p' src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py

Repository: axolotl-ai-cloud/axolotl

Length of output: 1041


🏁 Script executed:

grep -n "_sigmoid_topk_route" src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py

Repository: axolotl-ai-cloud/axolotl

Length of output: 136


🏁 Script executed:

sed -n '253,330p' src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py

Repository: axolotl-ai-cloud/axolotl

Length of output: 3436


🏁 Script executed:

sed -n '253,250p' src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py

Repository: axolotl-ai-cloud/axolotl

Length of output: 92


🏁 Script executed:

sed -n '253,340p' src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py

Repository: axolotl-ai-cloud/axolotl

Length of output: 3774


🏁 Script executed:

sed -n '195,215p' src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py

Repository: axolotl-ai-cloud/axolotl

Length of output: 972


🏁 Script executed:

sed -n '330,340p' src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py

Repository: axolotl-ai-cloud/axolotl

Length of output: 411


🏁 Script executed:

rg -A 10 "def _unwrap_experts_lora" src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py

Repository: axolotl-ai-cloud/axolotl

Length of output: 530


🌐 Web query:

Nemotron-H NemotronHMoE NemotronHTopkRouter router fields n_group topk_group site:huggingface.co

💡 Result:

Nemotron-H (NemotronH) is a hybrid Mamba-Transformer architecture from NVIDIA that includes Mixture-of-Experts (MoE) layers. The MoE uses a top-k router called NemotronHTopkRouter. Key router fields from NemotronHConfig: - n_group (int, default 1): Number of groups for expert routing. Experts are partitioned into this many groups. - topk_group (int, default 1): Number of selected groups for each token. Ensures selected experts are only within topk_group groups, implementing grouped top-k selection to reduce communication or improve efficiency. - Related: num_experts_per_tok (int, default 2): Top-k value, number of experts routed per token. - n_routed_experts (int, default 8): Total number of routed experts. The NemotronHTopkRouter computes routing logits (linear projection + sigmoid), selects top-K experts using grouped strategy (first top topk_group groups, then top num_experts_per_tok experts within), gathers weights, optionally normalizes, and applies scaling. This supports load balancing and group-based selection for hardware efficiency in distributed setups.

Citations:


🏁 Script executed:

rg -A 5 -B 5 "n_group|topk_group|norm_topk_prob|routed_scaling_factor" src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py | head -80

Repository: axolotl-ai-cloud/axolotl

Length of output: 2928


🏁 Script executed:

# Check if there are any tests or references to Nemotron-H in the codebase
rg -i "nemotron" src/

Repository: axolotl-ai-cloud/axolotl

Length of output: 1235


Nemotron-H ScatterMoE routing still misses gate-side router fields.

These expert-side changes enable the non-GLU MLP path, but _sigmoid_topk_route() only falls back to base_gate for top_k. Upstream NemotronHTopkRouter also stores n_group, topk_group, norm_topk_prob, and routed_scaling_factor on the router, so Lines 288-316 will skip grouped selection and use default scaling for Nemotron-H unless those fields also fall back to base_gate.

Suggested fix
 def _sigmoid_topk_route(
     moe_block, base_gate, hidden_states, gate_weight, gate_lora_delta
 ):
@@
-    num_experts = getattr(moe_block, "n_routed_experts", gate_weight.shape[0])
+    num_experts = getattr(
+        moe_block,
+        "n_routed_experts",
+        getattr(base_gate, "n_routed_experts", gate_weight.shape[0]),
+    )
@@
-    n_group = getattr(moe_block, "n_group", 1)
+    n_group = getattr(moe_block, "n_group", getattr(base_gate, "n_group", 1))
     if n_group > 1:
@@
-        topk_group = getattr(moe_block, "topk_group", n_group)
+        topk_group = getattr(
+            moe_block, "topk_group", getattr(base_gate, "topk_group", n_group)
+        )
@@
-    if getattr(moe_block, "norm_topk_prob", True):
+    if getattr(
+        moe_block, "norm_topk_prob", getattr(base_gate, "norm_topk_prob", True)
+    ):
         topk_weights = topk_weights / (topk_weights.sum(dim=-1, keepdim=True) + 1e-20)
-    routed_scaling_factor = getattr(moe_block, "routed_scaling_factor", 1.0)
+    routed_scaling_factor = getattr(
+        moe_block,
+        "routed_scaling_factor",
+        getattr(base_gate, "routed_scaling_factor", 1.0),
+    )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py` around lines
494 - 500, _sigmoid_topk_route currently only falls back to base_gate for top_k,
causing Nemotron-H routers to miss other router-side fields; update
_sigmoid_topk_route to also fallback to base_gate for the router attributes
n_group, topk_group, norm_topk_prob, and routed_scaling_factor when the detected
expert is non-GLU (match the behavior of NemotronHTopkRouter), so grouped
selection and scaling use the same defaults as the base_gate; check for these
attributes on the router object and assign from base_gate when missing, keeping
existing logic for top_k fallback intact.


# ====================================================================
# Optional latent projection before experts (e.g. Nemotron-H)
# ====================================================================
fc1_latent = getattr(self, "fc1_latent_proj", None)
if fc1_latent is not None:
hidden_states_flat = fc1_latent(hidden_states_flat)

# ====================================================================
# Selective expert weight dequantization
# ====================================================================
Expand All @@ -498,7 +515,7 @@ def forward(self: nn.Module, layer_input: torch.Tensor):
use_selective = (
getattr(self, "_use_selective_dequant", False)
and hasattr(experts, "parametrizations")
and "gate_up_proj" in experts.parametrizations
and up_proj_name in experts.parametrizations
)

if use_selective:
Expand All @@ -517,11 +534,11 @@ def forward(self: nn.Module, layer_input: torch.Tensor):
num_experts,
)
# Dequantize only active experts' weights
gate_up_W = selective_expert_weights(
up_W = selective_expert_weights(
experts,
"gate_up_proj",
up_proj_name,
active_experts,
).transpose(2, 1) # [num_active, hidden, 2*inter]
).transpose(2, 1)

# Remap LoRA weights to match compact expert indices
if gup_lora is not None:
Expand All @@ -538,18 +555,18 @@ def forward(self: nn.Module, layer_input: torch.Tensor):
sei_gup = remapped_expert_idxs
eo_gup = compact_offsets
else:
gate_up_W = experts.gate_up_proj.transpose(2, 1) # [E, hidden, 2*inter]
up_W = getattr(experts, up_proj_name).transpose(2, 1)
sei_gup = sorted_expert_idxs
eo_gup = expert_offsets

# ====================================================================
# Gate + Up projection
# Up projection (GLU: gate+up fused, non-GLU: up only)
# ====================================================================
if gup_lora is not None:
gup_A, gup_B, gup_scaling = gup_lora
gup = parallel_linear_lora(
up_out = parallel_linear_lora(
hidden_states_flat,
gate_up_W,
up_W,
top_k,
sei_gup,
sorted_scattered_idxs,
Expand All @@ -563,9 +580,9 @@ def forward(self: nn.Module, layer_input: torch.Tensor):
use_fused_gather=True,
)
else:
gup = parallel_linear(
up_out = parallel_linear(
hidden_states_flat,
gate_up_W,
up_W,
top_k,
sei_gup,
sorted_scattered_idxs,
Expand All @@ -574,8 +591,18 @@ def forward(self: nn.Module, layer_input: torch.Tensor):
grouped_out=True,
)

gates, h = gup.chunk(2, dim=-1)
h = experts.act_fn(gates) * h
# GLU: split into gate and up, apply act_fn(gate) * up
# Non-GLU: apply act_fn directly
if has_glu:
gates, h = up_out.chunk(2, dim=-1)
h = experts.act_fn(gates) * h
else:
h = experts.act_fn(up_out)

# Some activations (e.g. relu2) upcast to fp32 internally.
# Cast back to weight dtype for the down projection Triton kernel.
if h.dtype != experts.down_proj.dtype:
h = h.to(experts.down_proj.dtype)

# ====================================================================
# Down projection
Expand Down Expand Up @@ -635,6 +662,13 @@ def forward(self: nn.Module, layer_input: torch.Tensor):
gates=routing_weights,
)

# ====================================================================
# Optional latent projection after experts (e.g. Nemotron-H)
# ====================================================================
fc2_latent = getattr(self, "fc2_latent_proj", None)
if fc2_latent is not None:
expert_output = fc2_latent(expert_output)

# ====================================================================
# Combine with shared expert and reshape
# ====================================================================
Expand Down
80 changes: 60 additions & 20 deletions src/axolotl/integrations/kernels/sonicmoe/patch.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,8 @@ def patch_sonicmoe(model_type: str, torch_compile: bool = False):
torch_compile: If True, wrap routing functions with torch.compile
for kernel fusion (fuses softmax+topk+renorm into fewer launches).
"""
from sonicmoe.enums import is_glu

from .routing import get_model_moe_config
from .weight_converter import register_sonicmoe_weight_converter

Expand All @@ -49,7 +51,11 @@ def patch_sonicmoe(model_type: str, torch_compile: bool = False):

for moe_cls in resolve_moe_block_classes(model_type):
_patch_forward(moe_cls, routing_fn, activation, router_attr)
register_sonicmoe_weight_converter(model_type)

# Weight interleaving only applies to GLU models (gate_up_proj).
# Non-GLU models have a plain up_proj that needs no conversion.
if is_glu(activation):
register_sonicmoe_weight_converter(model_type)


def _try_compile_routing(routing_fn):
Expand Down Expand Up @@ -98,43 +104,60 @@ def _patch_forward(moe_cls, routing_fn, activation, router_attr):

def _make_general_forward(moe_cls, routing_fn, activation):
"""Create forward using routing_fn + moe_general_routing_inputs."""
from sonicmoe.enums import is_glu

glu_activation = is_glu(activation)

def sonicmoe_forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
from sonicmoe import moe_general_routing_inputs

batch_size, sequence_length, hidden_dim = hidden_states.shape
hidden_states_flat = hidden_states.view(-1, hidden_dim)

# Shared expert (computed early, matching original model ordering)
# Shared expert
shared_expert_output = _compute_shared_expert(self, hidden_states_flat)

# Routing
router_scores, token_indices, expert_indices, _router_logits = routing_fn(
hidden_states_flat, self
)

# Permute weights to SonicMoE layout:
# gate_up: [E, 2*I, H] -> [2*I, H, E]
# down: [E, H, I] -> [H, I, E]
gate_up_weight = self.experts.gate_up_proj.permute(1, 2, 0)
# Optional latent projection before experts (e.g. Nemotron-H)
expert_input = hidden_states_flat
fc1_latent = getattr(self, "fc1_latent_proj", None)
if fc1_latent is not None:
expert_input = fc1_latent(expert_input)

# Permute weights to SonicMoE layout.
# GLU models: gate_up_proj [E, 2*I, H] -> [2*I, H, E]
# Non-GLU: up_proj [E, I, H] -> [I, H, E]
if glu_activation:
up_weight = self.experts.gate_up_proj.permute(1, 2, 0)
else:
up_weight = self.experts.up_proj.permute(1, 2, 0)
down_weight = self.experts.down_proj.permute(1, 2, 0)
E = gate_up_weight.shape[-1]
E = up_weight.shape[-1]

output, _ = moe_general_routing_inputs(
hidden_states_flat,
expert_input,
router_scores,
token_indices,
expert_indices,
gate_up_weight,
None, # b1 (no gate/up bias)
up_weight,
None, # b1 (no bias)
down_weight,
None, # b2 (no down bias)
None, # b2 (no bias)
E,
torch.cuda.current_stream().cuda_stream,
activation,
False, # is_inference_mode
)

# Optional latent projection after experts (e.g. Nemotron-H)
fc2_latent = getattr(self, "fc2_latent_proj", None)
if fc2_latent is not None:
output = fc2_latent(output)

# Add shared expert contribution if present
if shared_expert_output is not None:
if hasattr(self, "shared_expert_gate"):
Expand All @@ -151,37 +174,54 @@ def sonicmoe_forward(self, hidden_states: torch.Tensor) -> torch.Tensor:

def _make_fused_forward(moe_cls, activation, router_attr):
"""Create forward using moe_TC_softmax_topk_layer (topk -> softmax)."""
from sonicmoe.enums import is_glu

glu_activation = is_glu(activation)

def sonicmoe_fused_forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
from sonicmoe import moe_TC_softmax_topk_layer

batch_size, sequence_length, hidden_dim = hidden_states.shape
hidden_states_flat = hidden_states.view(-1, hidden_dim)

# Shared expert (computed early, matching original model ordering)
# Shared expert
shared_expert_output = _compute_shared_expert(self, hidden_states_flat)

router = getattr(self, router_attr)

# Permute weights to SonicMoE layout:
# gate_up: [E, 2*I, H] -> [2*I, H, E]
# down: [E, H, I] -> [H, I, E]
gate_up_weight = self.experts.gate_up_proj.permute(1, 2, 0)
# Optional latent projection before experts (e.g. Nemotron-H)
expert_input = hidden_states_flat
fc1_latent = getattr(self, "fc1_latent_proj", None)
if fc1_latent is not None:
expert_input = fc1_latent(expert_input)

# Permute weights to SonicMoE layout.
# GLU models: gate_up_proj [E, 2*I, H] -> [2*I, H, E]
# Non-GLU: up_proj [E, I, H] -> [I, H, E]
if glu_activation:
up_weight = self.experts.gate_up_proj.permute(1, 2, 0)
else:
up_weight = self.experts.up_proj.permute(1, 2, 0)
down_weight = self.experts.down_proj.permute(1, 2, 0)

output, _router_logits, _expert_freq = moe_TC_softmax_topk_layer(
hidden_states_flat,
expert_input,
router.weight,
gate_up_weight,
None, # b1 (no gate/up bias)
up_weight,
None, # b1 (no bias)
down_weight,
None, # b2 (no down bias)
None, # b2 (no bias)
router.top_k,
torch.cuda.current_stream().cuda_stream,
activation,
False, # is_inference_mode
)

# Optional latent projection after experts (e.g. Nemotron-H)
fc2_latent = getattr(self, "fc2_latent_proj", None)
if fc2_latent is not None:
output = fc2_latent(output)

# Add shared expert contribution if present
if shared_expert_output is not None:
if hasattr(self, "shared_expert_gate"):
Expand Down
7 changes: 7 additions & 0 deletions src/axolotl/integrations/kernels/sonicmoe/routing.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,13 @@ def get_model_moe_config(model_type: str):
"minimax_m2",
):
return sigmoid_topk_routing, ActivationType.SWIGLU, "gate"
# Non-GLU MoE (no gate_proj, experts use up_proj + down_proj only)
elif model_type in ("nemotron_h",):
return sigmoid_topk_routing, ActivationType.RELU_SQ, "gate"
Comment on lines +62 to +64

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's find and examine the routing.py file
find . -name "routing.py" -type f | head -20

Repository: axolotl-ai-cloud/axolotl

Length of output: 122


🏁 Script executed:

# Also check the sonicmoe directory structure
find . -path "*/sonicmoe*" -type f | head -30

Repository: axolotl-ai-cloud/axolotl

Length of output: 295


🏁 Script executed:

# Let's look at the actual code at lines 60-62
head -70 src/axolotl/integrations/kernels/sonicmoe/routing.py | tail -20

Repository: axolotl-ai-cloud/axolotl

Length of output: 955


🏁 Script executed:

# Let's look at the sigmoid_topk_routing function
cat -n src/axolotl/integrations/kernels/sonicmoe/routing.py | head -250

Repository: axolotl-ai-cloud/axolotl

Length of output: 12192


🏁 Script executed:

# Continue reading the sigmoid_topk_routing function
cat -n src/axolotl/integrations/kernels/sonicmoe/routing.py | sed -n '250,300p'

Repository: axolotl-ai-cloud/axolotl

Length of output: 1679


🏁 Script executed:

# Let's search for Nemotron-H model files in the codebase
find . -type f -name "*nemotron*" | head -20

Repository: axolotl-ai-cloud/axolotl

Length of output: 115


🏁 Script executed:

# Search for any references to NemotronH in the codebase
rg -i "nemotron" --type py -A 5 -B 5 | head -100

Repository: axolotl-ai-cloud/axolotl

Length of output: 8785


🏁 Script executed:

# Let's check what Nemotron-H MoE structure looks like in the example config
cat examples/nemotron/nemotron-mini-4b-qlora.yaml | head -50

Repository: axolotl-ai-cloud/axolotl

Length of output: 877


🏁 Script executed:

# Now let's search for how sigmoid_topk_routing is currently being used or tested
rg "sigmoid_topk_routing" -A 10 -B 5 --type py

Repository: axolotl-ai-cloud/axolotl

Length of output: 29558


🏁 Script executed:

# Let's check the test helpers to understand how moe_block is structured
cat -n tests/integrations/test_sonicmoe.py | head -200

Repository: axolotl-ai-cloud/axolotl

Length of output: 8887


🏁 Script executed:

# Search for _make_glm_moe_block and _make_minimax_m2_moe_block to understand the block structure
rg "_make_glm_moe_block|_make_minimax_m2_moe_block" -A 20 --type py tests/

Repository: axolotl-ai-cloud/axolotl

Length of output: 9840


🌐 Web query:

Nemotron-H model structure top_k gate router attributes huggingface

💡 Result:

The Nemotron-H models are hybrid Mamba-Transformer architectures without MoE layers, using standard MLP (FFN) layers instead. They do not have a top_k gate router. Nemotron 3 Nano (30B-A3B), built on Nemotron-H, introduces sparse MoE layers replacing FFNs. It uses a learned MLP router (sigmoid gating) that activates the top 6 out of 128 routable experts per token, plus 2 shared experts, for ~3.6B active parameters per token from 31.6B total. Hugging Face NemotronHConfig supports MoE via layers_block_type including "moe", with num_experts_per_tok (default 2, top-k routing parameter), n_routed_experts (default 8), topk_group (default 1), and norm_topk_prob for normalizing top-k weights. Specific configs for Nemotron-H models lack MoE params as they use dense MLPs; MoE appears in Nemotron 3 models like Nano.

Citations:


🏁 Script executed:

# Also check if there are any local references or imports from the transformers library
rg "NemotronH|nemotron_h" -B 5 -A 10 --type py | grep -i "top_k\|n_group\|router\|gate"

Repository: axolotl-ai-cloud/axolotl

Length of output: 1163


🏁 Script executed:

# Let's check if there are any actual Nemotron-H MoE model files or tests
find . -type f \( -name "*.py" -o -name "*.yaml" \) -exec grep -l "nemotron" {} \;

Repository: axolotl-ai-cloud/axolotl

Length of output: 257


🏁 Script executed:

# Search for where top_k is defined or accessed in the routing code for other models
rg "top_k" -B 2 -A 2 src/axolotl/integrations/kernels/sonicmoe/routing.py

Repository: axolotl-ai-cloud/axolotl

Length of output: 1169


🏁 Script executed:

# Let's verify the actual parameter access patterns more carefully
# Check what parameters sigmoid_topk_routing expects and doesn't have fallbacks for
cat -n src/axolotl/integrations/kernels/sonicmoe/routing.py | sed -n '217,270p'

Repository: axolotl-ai-cloud/axolotl

Length of output: 2926


🏁 Script executed:

# Check the other routing functions to see their patterns
cat -n src/axolotl/integrations/kernels/sonicmoe/routing.py | sed -n '84,145p'

Repository: axolotl-ai-cloud/axolotl

Length of output: 3079


🏁 Script executed:

# Let's check the docstring for sigmoid_topk_routing to see what it documents as expected parameters
cat -n src/axolotl/integrations/kernels/sonicmoe/routing.py | sed -n '191,216p'

Repository: axolotl-ai-cloud/axolotl

Length of output: 1425


🏁 Script executed:

# Let's look at the patch.py to see how the routing function is called with the actual moe_block
cat -n src/axolotl/integrations/kernels/sonicmoe/patch.py | head -150

Repository: axolotl-ai-cloud/axolotl

Length of output: 6967


🏁 Script executed:

# Check if there's any model-specific handling or initialization for Nemotron-H
rg "nemotron" -B 10 -A 10 --type py src/

Repository: axolotl-ai-cloud/axolotl

Length of output: 5529


Add fallback logic for top_k and topk_group in sigmoid_topk_routing.

Lines 219 and 247 access moe_block.top_k and moe_block.topk_group directly without fallback, while other optional parameters in the same function (lines 263, 266) use getattr() with defaults. The function docstring marks these as optional, yet the code enforces them on moe_block only. Use getattr() with gate as secondary source to match the pattern established for e_score_correction_bias (lines 229-231):

Suggested fix
 def sigmoid_topk_routing(
     hidden_states: torch.Tensor, moe_block
 ) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
     gate = moe_block.gate
     T, H = hidden_states.shape
-    K = moe_block.top_k
+    K = getattr(moe_block, "top_k", getattr(gate, "top_k", None))
+    if K is None:
+        raise AttributeError(
+            f"sigmoid_topk_routing requires top_k on moe_block or gate, "
+            f"but neither has it"
+        )
     E = getattr(
         moe_block,
         "n_routed_experts",
         getattr(gate, "n_routed_experts", gate.weight.shape[0]),
     )
     n_group = getattr(moe_block, "n_group", getattr(gate, "n_group", 1))
     
     # ... (lines 223-246 unchanged) ...
     
     if n_group > 1:
         # ... (lines 241-245 unchanged) ...
-        group_idx = torch.topk(
-            group_scores, k=moe_block.topk_group, dim=-1, sorted=False
-        )[1]
+        topk_group = getattr(
+            moe_block, "topk_group", getattr(gate, "topk_group", n_group)
+        )
+        group_idx = torch.topk(group_scores, k=topk_group, dim=-1, sorted=False)[1]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Non-GLU MoE (no gate_proj, experts use up_proj + down_proj only)
elif model_type in ("nemotron_h",):
return sigmoid_topk_routing, ActivationType.RELU_SQ, "gate"
def sigmoid_topk_routing(
hidden_states: torch.Tensor, moe_block
) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
gate = moe_block.gate
T, H = hidden_states.shape
K = getattr(moe_block, "top_k", getattr(gate, "top_k", None))
if K is None:
raise AttributeError(
f"sigmoid_topk_routing requires top_k on moe_block or gate, "
f"but neither has it"
)
E = getattr(
moe_block,
"n_routed_experts",
getattr(gate, "n_routed_experts", gate.weight.shape[0]),
)
n_group = getattr(moe_block, "n_group", getattr(gate, "n_group", 1))
# ... rest of function with change at topk_group access ...
if n_group > 1:
# ...
topk_group = getattr(
moe_block, "topk_group", getattr(gate, "topk_group", n_group)
)
group_idx = torch.topk(group_scores, k=topk_group, dim=-1, sorted=False)[1]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/axolotl/integrations/kernels/sonicmoe/routing.py` around lines 60 - 62,
sigmoid_topk_routing currently accesses moe_block.top_k and moe_block.topk_group
directly which can be missing; change those accesses to use getattr(moe_block,
"top_k", getattr(gate, "top_k", <default>)) and getattr(moe_block, "topk_group",
getattr(gate, "topk_group", <default>)) so they follow the existing fallback
pattern used for e_score_correction_bias and other optional params; update the
references in sigmoid_topk_routing to use these getattr calls (and pick the same
sensible defaults used elsewhere in the function) so missing attributes on
moe_block fall back to gate before defaulting.

# elif model_type in ("deepseek_v2",):
# # Softmax→topk with group_limited_greedy. Different attr names: num_group
# # (not n_group), gate is nn.Linear (not a router class).
# return ..., ActivationType.SWIGLU, "gate"
elif model_type in ("ernie4_5_moe",):
return softmax_bias_topk_routing, ActivationType.SWIGLU, "gate"
elif model_type in ("hunyuan_v1_moe",):
Expand Down
Loading
Loading