[MoE] Add LFM2 MoE tuning support + tuned configs for H100/B200/MI325X#22791
Conversation
LFM2 MoE models (LiquidAI/LFM2-8B-A1B, LiquidAI/LFM2-24B-A2B) use num_experts / moe_intermediate_size config keys. The default Mixtral fallback expects num_local_experts / intermediate_size, so tuning either crashes or produces wrong kernel shapes. This adds an explicit branch for Lfm2MoeForCausalLM that reads the correct config fields.
Tuned fused MoE triton kernel configs for LFM2-8B-A1B (E=32) and LFM2-24B-A2B (E=64) at TP=1,2,4,8 on NVIDIA H100 80GB HBM3 and B200. Generated via tuning_fused_moe_triton.py grid search over 1920 kernel configs per batch size (1-4096 + 8192). Configs auto-load at inference via device_name matching. Delivers up to +47% throughput over default kernel configs at high concurrency (tp=1 benefits most; tp=8 shows minimal gains as shards approach existing config sweet spots).
Tuned fused MoE triton kernel configs for LFM2-8B-A1B and LFM2-24B-A2B at TP=1,2,4,8 on AMD Instinct MI325X. Generated via tuning_fused_moe_triton.py inside the v0.5.10 ROCm container which ships triton 3.6.0 (hence the separate directory from the NVIDIA configs in triton_3_5_1/). Note: On AMD, sglang routes MoE through aiter CK-MoE by default, which does not use these triton configs. The configs take effect only when --moe-runner-backend triton is set explicitly (e.g. for LoRA workloads where aiter CK-MoE is unavailable).
There was a problem hiding this comment.
Code Review
This pull request adds support for the Lfm2MoeForCausalLM architecture in the MoE Triton kernel benchmark and introduces a comprehensive set of Triton kernel configurations for NVIDIA B200, H100, and AMD MI325X GPUs. The review feedback suggests refactoring the architecture configuration logic to reduce code duplication and ensuring that the new JSON files include a trailing newline for consistency.
| elif architecture == "Lfm2MoeForCausalLM": | ||
| E = config.num_experts // ep_size | ||
| topk = config.num_experts_per_tok | ||
| intermediate_size = config.moe_intermediate_size |
There was a problem hiding this comment.
This elif block is identical to the logic for other architectures like BailingMoEForCausalLM (lines 124-131) and Qwen2MoeForCausalLM (lines 73-82). To avoid code duplication and improve maintainability, consider adding Lfm2MoeForCausalLM to one of the existing lists of architectures that share this logic.
| "num_stages": 2, | ||
| "waves_per_eu": 0 | ||
| } | ||
| } No newline at end of file |
|
The question is, are the triton directories correct, and how this could be future-proof when the trition version upgrades. |
@tugot17 Thanks for your support! When triton upgrades, it will hit to the newest version available Also please fix lint with |
|
@Fridge003 linting now passes |
|
could we merge it now? |
sgl-project#22791) Co-authored-by: Piotr Mazurek <piotr.mazurek@liquid.ai>
After sgl-project#23019 moved the MoE config loader and the configs/ tree from `fused_moe_triton/` to `moe_runner/triton_utils/`, two later PRs unknowingly added 33 tuned-config JSONs to the OLD path: - sgl-project#22791 (LFM2) — 24 files (E=32/64, H100/B200/MI325X) - sgl-project#23533 (Hy3 preview) — 9 files (E=192,N=192 incl. _down, H20/H20-3e/B200) The runtime loader anchors its search via os.path.dirname(os.path.realpath(__file__)) of the loader file (now in moe_runner/triton_utils/), so configs in the old directory were never read — runtime fell back to get_default_config(). The configs themselves were properly tuned and benchmarked at submission time via the in-process override_config() path used by the tuning script — that is why the PR authors observed real speedup. The bug is purely a wrong filesystem location. Root cause: the tuning README still pointed contributors to the old path. This PR moves the misplaced configs into the runtime-loaded location and fixes the README. Changes: * R100 git-mv 33 JSONs into moe_runner/triton_utils/configs/{triton_3_5_1,triton_3_6_0}/ * Update benchmark/kernels/fused_moe_triton/README.md path No content changes. No code changes. References: sgl-project#23019 sgl-project#22791 sgl-project#23533
After sgl-project#23019 moved the MoE config loader and the configs/ tree from `fused_moe_triton/` to `moe_runner/triton_utils/`, two later PRs unknowingly added 33 tuned-config JSONs to the OLD path: - sgl-project#22791 (LFM2) — 24 files (E=32/64, H100/B200/MI325X) - sgl-project#23533 (Hy3 preview) — 9 files (E=192,N=192 incl. _down, H20/H20-3e/B200) The runtime loader anchors its search via os.path.dirname(os.path.realpath(__file__)) of the loader file (now in moe_runner/triton_utils/), so configs in the old directory were never read — runtime fell back to get_default_config(). The configs themselves were properly tuned and benchmarked at submission time via the in-process override_config() path used by the tuning script — that is why the PR authors observed real speedup. The bug is purely a wrong filesystem location. Root cause: the tuning README still pointed contributors to the old path. This PR moves the misplaced configs into the runtime-loaded location and fixes the README. Changes: * R100 git-mv 33 JSONs into moe_runner/triton_utils/configs/{triton_3_5_1,triton_3_6_0}/ * Update benchmark/kernels/fused_moe_triton/README.md path No content changes. No code changes. References: sgl-project#23019 sgl-project#22791 sgl-project#23533
Summary
Adds
Lfm2MoeForCausalLMto the MoE tuning script and ships tuned fused MoE triton kernel configs for LFM2-8B-A1B and LFM2-24B-A2B at TP=1,2,4,8 on NVIDIA H100, B200, and AMD Instinct MI325X. Up to +47% throughput over default configs at high concurrency on NVIDIA.Motivation
LFM2 MoE models (
LiquidAI/LFM2-8B-A1B,LiquidAI/LFM2-24B-A2B) usenum_experts/moe_intermediate_sizeconfig keys. The default Mixtral fallback incommon_utils.pyexpectsnum_local_experts/intermediate_size, so tuning either crashes or produces wrong kernel shapes.Without tuned configs, the fused MoE triton kernel falls back to generic defaults that are far from optimal for LFM2's expert shapes. Per-rank shard sizes:
Changes
benchmark/kernels/fused_moe_triton/common_utils.py— 4-line branch forLfm2MoeForCausalLMthat readsnum_experts/moe_intermediate_sizepython/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_5_1/covering H100 and B200 × 8B/24B × TP=1,2,4,8python/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_6_0/covering MI325X × 8B/24B × TP=1,2,4,8 (AMD environment ships triton 3.6.0)Results
Peak output throughput at concurrency=8192, scenario
D(1024,256)sustained. Server flags:--enable-torch-compile --cuda-graph-max-bs 8192 --disable-radix-cache --mem-fraction-static 0.80 --dtype bfloat16.H100 80GB HBM3
B200
MI325X
On AMD, sglang's default
moe_runner_backend='auto'routes through aiter CK-MoE, which is faster than the triton fused-MoE and does not use these triton configs. The triton path is only active with--moe-runner-backend triton, which is required for LoRA serving on AMD (aiter CK-MoE does not support LoRA — seepython/sglang/srt/layers/moe/moe_runner/runner.py).With default triton MoE configs the triton path is ~40% slower than aiter. Our tuned configs close that gap almost entirely — at TP=1 the tuned triton path lands within 5–7% of aiter on both models:
Takeaway: AMD users running default inference see no change (aiter still wins). LoRA-on-AMD users — who are forced onto the triton path — get a big speedup and near-aiter performance.
Verification
Independently verified on a separate H100 node (sglang-from-source install, triton 3.5.1):
Notes
Triton versions
Configs are split across two directories based on the triton version they were tuned against:
triton_3_5_1/— NVIDIA (H100, B200). This is whattorch==2.9.1installs from PyPI on bare-metal Linux x86_64, which matches upstream sglang's default pin. If a future torch bump pulls a different triton version, these configs would need to be retuned against that new version.triton_3_6_0/— AMD (MI325X). Triton 3.6.0 is what ships inside thelmsysorg/sglang:v0.5.10-rocm720-mi30xcontainer. It also happens to be whatlmsysorg/sglang:v0.5.10-cu124ships on the NVIDIA container side, but we did not retune NVIDIA configs under 3.6.0 for this PR — followup work if containerized NVIDIA deployments need optimal kernels.Out of scope
_down.jsonconfigs — sglang v0.5.10 supports separate configs for the w2 down-projection viatuning_fused_moe_triton_sep.py(which needs pre-generated topk_ids from a running server). Not tuned here; down-proj falls back to defaults and logs a benigndown_moe=Falsewarning at startup. Reported gains are measured with down-proj at defaults, so adding_down.jsoncan only improve these numbers further.