-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
feat: add moe kernel support for non-glu #3558
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
c8d05de
2df8af8
3182d00
f70b6af
60215e8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -59,6 +59,13 @@ def get_model_moe_config(model_type: str): | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| "minimax_m2", | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ): | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| return sigmoid_topk_routing, ActivationType.SWIGLU, "gate" | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| # Non-GLU MoE (no gate_proj, experts use up_proj + down_proj only) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| elif model_type in ("nemotron_h",): | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| return sigmoid_topk_routing, ActivationType.RELU_SQ, "gate" | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Comment on lines
+62
to
+64
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chain🏁 Script executed: # First, let's find and examine the routing.py file
find . -name "routing.py" -type f | head -20Repository: axolotl-ai-cloud/axolotl Length of output: 122 🏁 Script executed: # Also check the sonicmoe directory structure
find . -path "*/sonicmoe*" -type f | head -30Repository: axolotl-ai-cloud/axolotl Length of output: 295 🏁 Script executed: # Let's look at the actual code at lines 60-62
head -70 src/axolotl/integrations/kernels/sonicmoe/routing.py | tail -20Repository: axolotl-ai-cloud/axolotl Length of output: 955 🏁 Script executed: # Let's look at the sigmoid_topk_routing function
cat -n src/axolotl/integrations/kernels/sonicmoe/routing.py | head -250Repository: axolotl-ai-cloud/axolotl Length of output: 12192 🏁 Script executed: # Continue reading the sigmoid_topk_routing function
cat -n src/axolotl/integrations/kernels/sonicmoe/routing.py | sed -n '250,300p'Repository: axolotl-ai-cloud/axolotl Length of output: 1679 🏁 Script executed: # Let's search for Nemotron-H model files in the codebase
find . -type f -name "*nemotron*" | head -20Repository: axolotl-ai-cloud/axolotl Length of output: 115 🏁 Script executed: # Search for any references to NemotronH in the codebase
rg -i "nemotron" --type py -A 5 -B 5 | head -100Repository: axolotl-ai-cloud/axolotl Length of output: 8785 🏁 Script executed: # Let's check what Nemotron-H MoE structure looks like in the example config
cat examples/nemotron/nemotron-mini-4b-qlora.yaml | head -50Repository: axolotl-ai-cloud/axolotl Length of output: 877 🏁 Script executed: # Now let's search for how sigmoid_topk_routing is currently being used or tested
rg "sigmoid_topk_routing" -A 10 -B 5 --type pyRepository: axolotl-ai-cloud/axolotl Length of output: 29558 🏁 Script executed: # Let's check the test helpers to understand how moe_block is structured
cat -n tests/integrations/test_sonicmoe.py | head -200Repository: axolotl-ai-cloud/axolotl Length of output: 8887 🏁 Script executed: # Search for _make_glm_moe_block and _make_minimax_m2_moe_block to understand the block structure
rg "_make_glm_moe_block|_make_minimax_m2_moe_block" -A 20 --type py tests/Repository: axolotl-ai-cloud/axolotl Length of output: 9840 🌐 Web query:
💡 Result: The Nemotron-H models are hybrid Mamba-Transformer architectures without MoE layers, using standard MLP (FFN) layers instead. They do not have a top_k gate router. Nemotron 3 Nano (30B-A3B), built on Nemotron-H, introduces sparse MoE layers replacing FFNs. It uses a learned MLP router (sigmoid gating) that activates the top 6 out of 128 routable experts per token, plus 2 shared experts, for ~3.6B active parameters per token from 31.6B total. Hugging Face NemotronHConfig supports MoE via layers_block_type including "moe", with num_experts_per_tok (default 2, top-k routing parameter), n_routed_experts (default 8), topk_group (default 1), and norm_topk_prob for normalizing top-k weights. Specific configs for Nemotron-H models lack MoE params as they use dense MLPs; MoE appears in Nemotron 3 models like Nano. Citations:
🏁 Script executed: # Also check if there are any local references or imports from the transformers library
rg "NemotronH|nemotron_h" -B 5 -A 10 --type py | grep -i "top_k\|n_group\|router\|gate"Repository: axolotl-ai-cloud/axolotl Length of output: 1163 🏁 Script executed: # Let's check if there are any actual Nemotron-H MoE model files or tests
find . -type f \( -name "*.py" -o -name "*.yaml" \) -exec grep -l "nemotron" {} \;Repository: axolotl-ai-cloud/axolotl Length of output: 257 🏁 Script executed: # Search for where top_k is defined or accessed in the routing code for other models
rg "top_k" -B 2 -A 2 src/axolotl/integrations/kernels/sonicmoe/routing.pyRepository: axolotl-ai-cloud/axolotl Length of output: 1169 🏁 Script executed: # Let's verify the actual parameter access patterns more carefully
# Check what parameters sigmoid_topk_routing expects and doesn't have fallbacks for
cat -n src/axolotl/integrations/kernels/sonicmoe/routing.py | sed -n '217,270p'Repository: axolotl-ai-cloud/axolotl Length of output: 2926 🏁 Script executed: # Check the other routing functions to see their patterns
cat -n src/axolotl/integrations/kernels/sonicmoe/routing.py | sed -n '84,145p'Repository: axolotl-ai-cloud/axolotl Length of output: 3079 🏁 Script executed: # Let's check the docstring for sigmoid_topk_routing to see what it documents as expected parameters
cat -n src/axolotl/integrations/kernels/sonicmoe/routing.py | sed -n '191,216p'Repository: axolotl-ai-cloud/axolotl Length of output: 1425 🏁 Script executed: # Let's look at the patch.py to see how the routing function is called with the actual moe_block
cat -n src/axolotl/integrations/kernels/sonicmoe/patch.py | head -150Repository: axolotl-ai-cloud/axolotl Length of output: 6967 🏁 Script executed: # Check if there's any model-specific handling or initialization for Nemotron-H
rg "nemotron" -B 10 -A 10 --type py src/Repository: axolotl-ai-cloud/axolotl Length of output: 5529 Add fallback logic for Lines 219 and 247 access Suggested fix def sigmoid_topk_routing(
hidden_states: torch.Tensor, moe_block
) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]:
gate = moe_block.gate
T, H = hidden_states.shape
- K = moe_block.top_k
+ K = getattr(moe_block, "top_k", getattr(gate, "top_k", None))
+ if K is None:
+ raise AttributeError(
+ f"sigmoid_topk_routing requires top_k on moe_block or gate, "
+ f"but neither has it"
+ )
E = getattr(
moe_block,
"n_routed_experts",
getattr(gate, "n_routed_experts", gate.weight.shape[0]),
)
n_group = getattr(moe_block, "n_group", getattr(gate, "n_group", 1))
# ... (lines 223-246 unchanged) ...
if n_group > 1:
# ... (lines 241-245 unchanged) ...
- group_idx = torch.topk(
- group_scores, k=moe_block.topk_group, dim=-1, sorted=False
- )[1]
+ topk_group = getattr(
+ moe_block, "topk_group", getattr(gate, "topk_group", n_group)
+ )
+ group_idx = torch.topk(group_scores, k=topk_group, dim=-1, sorted=False)[1]📝 Committable suggestion
Suggested change
🤖 Prompt for AI Agents |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| # elif model_type in ("deepseek_v2",): | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| # # Softmax→topk with group_limited_greedy. Different attr names: num_group | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| # # (not n_group), gate is nn.Linear (not a router class). | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| # return ..., ActivationType.SWIGLU, "gate" | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| elif model_type in ("ernie4_5_moe",): | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| return softmax_bias_topk_routing, ActivationType.SWIGLU, "gate" | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| elif model_type in ("hunyuan_v1_moe",): | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
Repository: axolotl-ai-cloud/axolotl
Length of output: 1190
🏁 Script executed:
cat -n src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.py | head -50Repository: axolotl-ai-cloud/axolotl
Length of output: 2481
🏁 Script executed:
Repository: axolotl-ai-cloud/axolotl
Length of output: 135
🏁 Script executed:
sed -n '280,320p' src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.pyRepository: axolotl-ai-cloud/axolotl
Length of output: 1788
🏁 Script executed:
sed -n '494,510p' src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.pyRepository: axolotl-ai-cloud/axolotl
Length of output: 1041
🏁 Script executed:
grep -n "_sigmoid_topk_route" src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.pyRepository: axolotl-ai-cloud/axolotl
Length of output: 136
🏁 Script executed:
sed -n '253,330p' src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.pyRepository: axolotl-ai-cloud/axolotl
Length of output: 3436
🏁 Script executed:
sed -n '253,250p' src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.pyRepository: axolotl-ai-cloud/axolotl
Length of output: 92
🏁 Script executed:
sed -n '253,340p' src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.pyRepository: axolotl-ai-cloud/axolotl
Length of output: 3774
🏁 Script executed:
sed -n '195,215p' src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.pyRepository: axolotl-ai-cloud/axolotl
Length of output: 972
🏁 Script executed:
sed -n '330,340p' src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.pyRepository: axolotl-ai-cloud/axolotl
Length of output: 411
🏁 Script executed:
rg -A 10 "def _unwrap_experts_lora" src/axolotl/integrations/kernels/libs/scattermoe_lora/layers.pyRepository: axolotl-ai-cloud/axolotl
Length of output: 530
🌐 Web query:
Nemotron-H NemotronHMoE NemotronHTopkRouter router fields n_group topk_group site:huggingface.co💡 Result:
Nemotron-H (NemotronH) is a hybrid Mamba-Transformer architecture from NVIDIA that includes Mixture-of-Experts (MoE) layers. The MoE uses a top-k router called NemotronHTopkRouter. Key router fields from NemotronHConfig: - n_group (int, default 1): Number of groups for expert routing. Experts are partitioned into this many groups. - topk_group (int, default 1): Number of selected groups for each token. Ensures selected experts are only within topk_group groups, implementing grouped top-k selection to reduce communication or improve efficiency. - Related: num_experts_per_tok (int, default 2): Top-k value, number of experts routed per token. - n_routed_experts (int, default 8): Total number of routed experts. The NemotronHTopkRouter computes routing logits (linear projection + sigmoid), selects top-K experts using grouped strategy (first top topk_group groups, then top num_experts_per_tok experts within), gathers weights, optionally normalizes, and applies scaling. This supports load balancing and group-based selection for hardware efficiency in distributed setups.
Citations:
🏁 Script executed:
Repository: axolotl-ai-cloud/axolotl
Length of output: 2928
🏁 Script executed:
Repository: axolotl-ai-cloud/axolotl
Length of output: 1235
Nemotron-H ScatterMoE routing still misses gate-side router fields.
These expert-side changes enable the non-GLU MLP path, but
_sigmoid_topk_route()only falls back tobase_gatefortop_k. UpstreamNemotronHTopkRouteralso storesn_group,topk_group,norm_topk_prob, androuted_scaling_factoron the router, so Lines 288-316 will skip grouped selection and use default scaling for Nemotron-H unless those fields also fall back tobase_gate.Suggested fix
def _sigmoid_topk_route( moe_block, base_gate, hidden_states, gate_weight, gate_lora_delta ): @@ - num_experts = getattr(moe_block, "n_routed_experts", gate_weight.shape[0]) + num_experts = getattr( + moe_block, + "n_routed_experts", + getattr(base_gate, "n_routed_experts", gate_weight.shape[0]), + ) @@ - n_group = getattr(moe_block, "n_group", 1) + n_group = getattr(moe_block, "n_group", getattr(base_gate, "n_group", 1)) if n_group > 1: @@ - topk_group = getattr(moe_block, "topk_group", n_group) + topk_group = getattr( + moe_block, "topk_group", getattr(base_gate, "topk_group", n_group) + ) @@ - if getattr(moe_block, "norm_topk_prob", True): + if getattr( + moe_block, "norm_topk_prob", getattr(base_gate, "norm_topk_prob", True) + ): topk_weights = topk_weights / (topk_weights.sum(dim=-1, keepdim=True) + 1e-20) - routed_scaling_factor = getattr(moe_block, "routed_scaling_factor", 1.0) + routed_scaling_factor = getattr( + moe_block, + "routed_scaling_factor", + getattr(base_gate, "routed_scaling_factor", 1.0), + )🤖 Prompt for AI Agents