Add MiniMax M25 FMoE tunings by akii96 · Pull Request #2982 · ROCm/aiter

akii96 · 2026-04-30T13:49:56Z

Adds MiniMax M25 FMoE tuning entries and keeps the tuning table deduplicated and sorted by token

github-actions · 2026-04-30T13:50:23Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:triton-300x`	Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
`ci:sglang`	SGLang integration tests
`ci:atom`	ATOM benchmark (DeepSeek-R1 + GPT-OSS)
`ci:vllm`	vLLM benchmark
`ci:all`	All of the above

Add labels via the sidebar or gh pr edit 2982 --add-label <label>

akii96 · 2026-05-04T13:43:48Z

Latest updates:

add extra instances based on previous commit by @amd-yashagar
added the cleaned up fmoe tunings and not some bulk dump of 1000s of entries

sunway513 · 2026-05-04T14:56:17Z

This PR's content was bulk-merged via #3004 ([Silo] Bulk merge: tuned GEMM and FMoE configs, merged 2026-05-02 03:16 UTC). Please close this PR as superseded.

Tracking issue: ROCm/AI-Frameworks-Dashboard#141

Squash-merged from main commit 52c4554. Includes 5 atomic Silo PRs: - #2923 GLM-4.7 FP8 tuned/untuned FMoE configs (new) - #2938 Kimi-K2.5 FP4 fused MoE tunings (TP2 / 256 CU refresh) - #2979 MiniMax-M2.5 A8W8 blockscale GEMM tunings - #2981 DeepSeek-V3.2 MI355X tuned GEMM and FMoE configs - #2982 MiniMax-M2.5 FMoE tunings Conflict in aiter/configs/model_configs/kimik2_fp4_tuned_fmoe.csv: two blocks resolved by taking theirs (Silo). Block 1 upgrades existing M=256/N=512 rows from base kernel suffixes (w3) to tuner-discovered variants (w3_xcd4, _bnt2_persist, _sbm32, _sbm64). Block 2 is purely additive: 30+ new rows for previously-uncovered N=7168/K=1024 shapes plus a flydsl_fallback section. Driver: vLLM 0.21 freeze 2026-05-08 — Silo customers need these tunings on the AITER release wheel, not nightly. Verification gate before tag: - Kernel suffix parser smoke (Kimi-K2.5-MXFP4 1-token inference, confirm new suffixes JIT-compile without falling back) - ATOM 5-model accuracy unchanged within +/- 0.005 vs v0.1.13-rc1 - Perf delta on Kimi-K2.5 / MiniMax-M2.5 / DSv3.2 (expect flat or better) (cherry picked from commit 52c4554)

akii96 · 2026-05-04T18:09:28Z

Hi @sunway513

One clarification before closing: this PR is not fully covered by the #3004 config-only bulk merge. In aiter/configs/tuned_fmoe.csv, the new MiniMax FMoE rows for token=4096 and token=8192 with model_dim=3072, inter_dim=384 reference the added 256x64x128x128 ... A8W8blkscale_v1 CK two-stage instances from csrc/ck_gemm_moe_2stages_codegen/gemm_moe_ck2stages_common.py. So those CSV rows do depend on the instance additions, which were left out of #3004.

…iles Same lowest-'us'-wins resolution as the GEMM dedup, applied to FMoE. The build's update_config_files asserts no shape collisions across merged tuned_fmoe files (key = untuned_fmoe.csv columns + cu_num + _tag); the additions in this PR introduced 65 cross-file collisions between tuned_fmoe.csv and 4 model_configs files. Resolution (best 'us' per shape): - tuned_fmoe.csv: 1080 -> 1039 rows (lost 41 to model files with better existing tunings — mostly 26 minimax + 10 glm47 + 4 ds_v3 + 1 qwen3_235b) - a8w8_blockscale_tuned_fmoe_ds_v3.csv: 16 -> 4 (12 superseded by #2981) - a8w8_blockscale_tuned_fmoe_minimax-m2_5.csv: 32 -> 26 (6 superseded by #2982) - a8w8_blockscale_tuned_fmoe_qwen3_235b.csv: 32 -> 28 (4 superseded by #2981) - glm47_fp8_tuned_fmoe.csv: 16 -> 14 (2 superseded by #2981) Every shape contributed by #2981 and #2982 remains covered post-dedup — where their row was not the winner, the existing model-specific tuning wins on its own merits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Add MiniMax M25 A8W8 blockscale GEMM tunings on gfx950 (splitK + AQRowMajor) * instances * Add MiniMax M25 FMoE cleaned entries * [configs] Add MI355X tuned GEMM and FMoE configs for DeepSeek-V3.2 Add gfx950 (MI355X, cu_num=256) tuning results for A8W8 block-scale GEMM and fused MoE kernels, optimized for DeepSeek-V3.2 shapes. GEMM (a8w8_blockscale_tuned_gemm.csv): - 6375 entries covering M=1..8192 for all DSv32 (N,K) shapes - Includes split-K tuned configs per shape (best of splitK=0 vs splitK>0) - Key decode (M=1) improvements: 128x7168 -59%, 7168x4096 -33% FMoE (tuned_fmoe.csv): - 802 cu_num=256 entries for DSv32 expert dimensions (N=512/4096/4608/7168, K=1536/7168/9216) - Replaces 751 previous cu_num=256 entries with re-tuned results - Existing cu_num=80 (MI300X) entries unchanged Made-with: Cursor (Cherry-picked from #2981 to restore content lost in bulk merge #3004. Net semantic effect of this PR vs current main: GEMM: +6375 / -5 FMoE: +57 / -7 The remaining 1042 of #2981's 1099 textual FMoE adds are content- identical reorderings already present on main.) * [configs] dedup colliding (M,N,K,cu_num,gfx) shapes between #2981 and ds_v3 #2981 added 223 DSv3.2 GEMM tunings to a8w8_blockscale_tuned_gemm.csv that share (M,N,K,cu_num,gfx) shape keys with the model-specific a8w8_blockscale_tuned_gemm_ds_v3.csv. The aiter build asserts no shape collisions across merged config files; resolve by keeping the lowest 'us' row per shape: - 187 ds_v3 rows dropped (superseded by #2981's better tunings) - 36 #2981 rows dropped (ds_v3's existing tunings were faster) Result: every conflicting shape still has a tuning, picked from whichever file had the better measurement. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [configs] dedup colliding FMoE shapes across global + model_configs files Same lowest-'us'-wins resolution as the GEMM dedup, applied to FMoE. The build's update_config_files asserts no shape collisions across merged tuned_fmoe files (key = untuned_fmoe.csv columns + cu_num + _tag); the additions in this PR introduced 65 cross-file collisions between tuned_fmoe.csv and 4 model_configs files. Resolution (best 'us' per shape): - tuned_fmoe.csv: 1080 -> 1039 rows (lost 41 to model files with better existing tunings — mostly 26 minimax + 10 glm47 + 4 ds_v3 + 1 qwen3_235b) - a8w8_blockscale_tuned_fmoe_ds_v3.csv: 16 -> 4 (12 superseded by #2981) - a8w8_blockscale_tuned_fmoe_minimax-m2_5.csv: 32 -> 26 (6 superseded by #2982) - a8w8_blockscale_tuned_fmoe_qwen3_235b.csv: 32 -> 28 (4 superseded by #2981) - glm47_fp8_tuned_fmoe.csv: 16 -> 14 (2 superseded by #2981) Every shape contributed by #2981 and #2982 remains covered post-dedup — where their row was not the winner, the existing model-specific tuning wins on its own merits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [configs] Move MiniMax FMoE tunings to model config Keep the surviving MiniMax M25 FMoE rows in the model-specific config file instead of the global tuned_fmoe table. * [configs] Move DeepSeek tunings to model configs Keep the surviving DeepSeek V3.2 tuning rows in model-specific config files instead of the global tuning tables. * [configs] Move GLM FMoE rows back to GLM config Address likely MiniMax tuning contamination by moving the per-token GLM-4.7 FMoE entries back to the GLM model config where they belong. * [moe] Disable blockscale GEMM2 instance that exceeds LDS Avoid prebuilding the 256x128x128x128 2x2 blockscale GEMM2 candidate, which exceeds the local memory limit during JIT compilation. * [moe] Disable risky/unused blockscale MoE instances --------- Co-authored-by: Aakif Nawaz <aaknawaz@amd.com> Co-authored-by: Aakif Nawaz <aakif.nawaz@amd.com> Co-authored-by: frida-andersson <fanderss@amd.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

akii96 · 2026-05-05T19:17:15Z

Merged with #3024

* Add MiniMax M25 A8W8 blockscale GEMM tunings on gfx950 (splitK + AQRowMajor) * instances * Add MiniMax M25 FMoE cleaned entries * [configs] Add MI355X tuned GEMM and FMoE configs for DeepSeek-V3.2 Add gfx950 (MI355X, cu_num=256) tuning results for A8W8 block-scale GEMM and fused MoE kernels, optimized for DeepSeek-V3.2 shapes. GEMM (a8w8_blockscale_tuned_gemm.csv): - 6375 entries covering M=1..8192 for all DSv32 (N,K) shapes - Includes split-K tuned configs per shape (best of splitK=0 vs splitK>0) - Key decode (M=1) improvements: 128x7168 -59%, 7168x4096 -33% FMoE (tuned_fmoe.csv): - 802 cu_num=256 entries for DSv32 expert dimensions (N=512/4096/4608/7168, K=1536/7168/9216) - Replaces 751 previous cu_num=256 entries with re-tuned results - Existing cu_num=80 (MI300X) entries unchanged Made-with: Cursor (Cherry-picked from #2981 to restore content lost in bulk merge #3004. Net semantic effect of this PR vs current main: GEMM: +6375 / -5 FMoE: +57 / -7 The remaining 1042 of #2981's 1099 textual FMoE adds are content- identical reorderings already present on main.) * [configs] dedup colliding (M,N,K,cu_num,gfx) shapes between #2981 and ds_v3 #2981 added 223 DSv3.2 GEMM tunings to a8w8_blockscale_tuned_gemm.csv that share (M,N,K,cu_num,gfx) shape keys with the model-specific a8w8_blockscale_tuned_gemm_ds_v3.csv. The aiter build asserts no shape collisions across merged config files; resolve by keeping the lowest 'us' row per shape: - 187 ds_v3 rows dropped (superseded by #2981's better tunings) - 36 #2981 rows dropped (ds_v3's existing tunings were faster) Result: every conflicting shape still has a tuning, picked from whichever file had the better measurement. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [configs] dedup colliding FMoE shapes across global + model_configs files Same lowest-'us'-wins resolution as the GEMM dedup, applied to FMoE. The build's update_config_files asserts no shape collisions across merged tuned_fmoe files (key = untuned_fmoe.csv columns + cu_num + _tag); the additions in this PR introduced 65 cross-file collisions between tuned_fmoe.csv and 4 model_configs files. Resolution (best 'us' per shape): - tuned_fmoe.csv: 1080 -> 1039 rows (lost 41 to model files with better existing tunings — mostly 26 minimax + 10 glm47 + 4 ds_v3 + 1 qwen3_235b) - a8w8_blockscale_tuned_fmoe_ds_v3.csv: 16 -> 4 (12 superseded by #2981) - a8w8_blockscale_tuned_fmoe_minimax-m2_5.csv: 32 -> 26 (6 superseded by #2982) - a8w8_blockscale_tuned_fmoe_qwen3_235b.csv: 32 -> 28 (4 superseded by #2981) - glm47_fp8_tuned_fmoe.csv: 16 -> 14 (2 superseded by #2981) Every shape contributed by #2981 and #2982 remains covered post-dedup — where their row was not the winner, the existing model-specific tuning wins on its own merits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [configs] Move MiniMax FMoE tunings to model config Keep the surviving MiniMax M25 FMoE rows in the model-specific config file instead of the global tuned_fmoe table. * [configs] Move DeepSeek tunings to model configs Keep the surviving DeepSeek V3.2 tuning rows in model-specific config files instead of the global tuning tables. * [configs] Move GLM FMoE rows back to GLM config Address likely MiniMax tuning contamination by moving the per-token GLM-4.7 FMoE entries back to the GLM model config where they belong. * [moe] Disable blockscale GEMM2 instance that exceeds LDS Avoid prebuilding the 256x128x128x128 2x2 blockscale GEMM2 candidate, which exceeds the local memory limit during JIT compilation. * [moe] Disable risky/unused blockscale MoE instances --------- Co-authored-by: Aakif Nawaz <aaknawaz@amd.com> Co-authored-by: Aakif Nawaz <aakif.nawaz@amd.com> Co-authored-by: frida-andersson <fanderss@amd.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

sunway513 mentioned this pull request May 1, 2026

[Silo] Bulk merge: tuned GEMM and FMoE configs (GLM-4.7, Kimi-K2.5, MiniMax-M2.5, DeepSeek-V3.2) #3004

Merged

2 tasks

akii96 added 2 commits May 4, 2026 13:38

instances

d13cc8a

Add MiniMax M25 FMoE cleaned entries

374f437

akii96 force-pushed the add-minimax-m25-fmoe-tunings branch from 1979e04 to 374f437 Compare May 4, 2026 13:40

akii96 marked this pull request as ready for review May 4, 2026 13:41

akii96 requested review from a team and amd-yashagar May 4, 2026 13:41

azaidy mentioned this pull request May 4, 2026

[Silo] Add configs missing from bulk merge #3004 #3024

Merged

2 tasks

akii96 marked this pull request as draft May 5, 2026 11:41

akii96 closed this May 5, 2026

akii96 deleted the add-minimax-m25-fmoe-tunings branch May 5, 2026 19:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MiniMax M25 FMoE tunings#2982

Add MiniMax M25 FMoE tunings#2982
akii96 wants to merge 2 commits intomainfrom
add-minimax-m25-fmoe-tunings

akii96 commented Apr 30, 2026

Uh oh!

github-actions Bot commented Apr 30, 2026

Uh oh!

akii96 commented May 4, 2026 •

edited

Loading

Uh oh!

sunway513 commented May 4, 2026

Uh oh!

akii96 commented May 4, 2026

Uh oh!

akii96 commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

akii96 commented Apr 30, 2026

Uh oh!

github-actions Bot commented Apr 30, 2026

🏷️ CI Guide

Uh oh!

akii96 commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Latest updates:

Uh oh!

sunway513 commented May 4, 2026

Uh oh!

akii96 commented May 4, 2026

Uh oh!

akii96 commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

akii96 commented May 4, 2026 •

edited

Loading