[Silo] Bulk merge: tuned GEMM and FMoE configs (GLM-4.7, Kimi-K2.5, MiniMax-M2.5, DeepSeek-V3.2) by sunway513 · Pull Request #3004 · ROCm/aiter

sunway513 · 2026-05-01T23:30:32Z

Summary

Bulk merge of 5 Silo config-only PRs for v0.1.13 deadline (vLLM 0.21 freeze 2026-05-08).

Pure CSV tuning configs — no code changes.

Included PRs

Add GLM-4.7 FP8 tuned and untuned FMOE configs #2923 — GLM-4.7 FP8 tuned/untuned FMoE configs
refresh Kimi K2 FP4 fused MoE tunings (TP2 / 256 CU) #2938 — Kimi K2.5 FP4 fused MoE tunings (TP2 / 256 CU refresh)
Add MiniMax M25 A8W8 blockscale GEMM tunings #2979 — MiniMax M2.5 A8W8 blockscale GEMM tunings
[configs] Add MI355X tuned GEMM and FMoE configs for DeepSeek-V3.2 #2981 — DeepSeek-V3.2 MI355X tuned GEMM and FMoE configs
Add MiniMax M25 FMoE tunings #2982 — MiniMax M2.5 FMoE tunings

Files Changed

All under aiter/configs/:

a8w8_blockscale_tuned_gemm.csv
tuned_fmoe.csv
model_configs/glm47_fp8_tuned_fmoe.csv (new)
model_configs/glm47_fp8_untuned_fmoe.csv (new)
model_configs/kimik2_fp4_tuned_fmoe.csv

Risk

Low — CSV config files only. No kernel code, no API changes, no build changes.

Test Plan

CI passes (no code changes, config load only)
Spot check: ATOM loads new configs without error

Add gfx950 (MI355X, cu_num=256) tuning results for A8W8 block-scale GEMM and fused MoE kernels, optimized for DeepSeek-V3.2 shapes. GEMM (a8w8_blockscale_tuned_gemm.csv): - 6375 entries covering M=1..8192 for all DSv32 (N,K) shapes - Includes split-K tuned configs per shape (best of splitK=0 vs splitK>0) - Key decode (M=1) improvements: 128x7168 -59%, 7168x4096 -33% FMoE (tuned_fmoe.csv): - 802 cu_num=256 entries for DSv32 expert dimensions (N=512/4096/4608/7168, K=1536/7168/9216) - Replaces 751 previous cu_num=256 entries with re-tuned results - Existing cu_num=80 (MI300X) entries unchanged Made-with: Cursor

github-actions · 2026-05-01T23:31:07Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:triton-300x`	Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
`ci:sglang`	SGLang integration tests
`ci:atom`	ATOM benchmark (DeepSeek-R1 + GPT-OSS)
`ci:vllm`	vLLM benchmark
`ci:all`	All of the above

Add labels via the sidebar or gh pr edit 3004 --add-label <label>

Copilot

Pull request overview

Bulk merge of multiple Silo “config-only” PRs, adding/updating tuned GEMM and fused-MoE (FMoE) CSV tables under aiter/configs/ (and aiter/configs/model_configs/) to improve kernel selection for specific models/hardware targets (e.g., MI355X / cu_num=256).

Changes:

Update model-specific tuned FMoE table for Kimi K2.5 FP4 (kimik2_fp4_tuned_fmoe.csv).
Add GLM-4.7 FP8 tuned + untuned FMoE shape tables (glm47_fp8_{tuned,untuned}_fmoe.csv).
(Per PR description) bulk-update shared tuned tables for GEMM/FMoE under aiter/configs/.

Reviewed changes

Copilot reviewed 2 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
aiter/configs/a8w8_blockscale_tuned_gemm.csv	(Per PR description) bulk tuning table updates for A8W8 blockscale GEMMs.
aiter/configs/tuned_fmoe.csv	(Per PR description) bulk tuning table updates for fused MoE kernel selection.
aiter/configs/model_configs/glm47_fp8_tuned_fmoe.csv	New GLM-4.7 FP8 tuned FMoE configs (cu_num=256).
aiter/configs/model_configs/glm47_fp8_untuned_fmoe.csv	New GLM-4.7 FP8 untuned reference shapes (input list for tuning).
aiter/configs/model_configs/kimik2_fp4_tuned_fmoe.csv	Updated Kimi K2.5 FP4 tuned FMoE configs (cu_num=256).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+256,1,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0,32,0,24.9944,moe_ck2stages_gemm1_256x32x64x256_1x4_TypeCast_v1_Nswizzle0_Quant2_MulRoutedWeight0_silu_F8_F8_B16,0.00%,20.3787,moe_ck2stages_gemm2_256x32x128x256_1x4_TypeCast_v1_Nswizzle0_Quant2_MulRoutedWeight1_F8_F8_B16,3.80%,45.3731,0,8.32,20799.41
+256,2,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0,32,0,40.11,_ZN5aiter48fmoe_stage1_bf16_pertokenFp8_g1u1_32x128_3tg_pf3E,0.00%,28.5666,moe_ck2stages_gemm2_256x32x128x256_1x4_TypeCast_v1_Nswizzle0_Quant2_MulRoutedWeight1_F8_F8_B16,3.80%,68.6766,0,10.99,13741.93
+256,4,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0,32,0,99.8134,_ZN5aiter45fmoe_bf16_pertokenFp8_g1u1_vs_silu_1tg_32x192E,0.00%,0,,0.00%,99.8134,1,15.13,9455.44
+256,8,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0,32,0,134.9101,_ZN5aiter45fmoe_bf16_pertokenFp8_g1u1_vs_silu_1tg_32x256E,0.00%,0,,0.00%,134.9101,1,22.38,6996.08
+256,16,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0,32,0,156.6731,_ZN5aiter45fmoe_bf16_pertokenFp8_g1u1_vs_silu_1tg_32x256E,0.00%,0,,0.00%,156.6731,1,38.55,6025.06
+256,32,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0,32,0,164.2209,_ZN5aiter45fmoe_bf16_pertokenFp8_g1u1_vs_silu_1tg_32x256E,0.00%,0,,0.00%,164.2209,1,73.56,5749.63
+256,64,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0,32,0,168.1699,_ZN5aiter45fmoe_bf16_pertokenFp8_g1u1_vs_silu_1tg_32x256E,0.00%,0,,0.00%,168.1699,1,143.66,5617.54
+256,128,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0,32,0,177.1016,_ZN5aiter48fmoe_bf16_pertokenFp8_g1u1_vs_silu_1tg_ps_32x384E,0.00%,0,,0.00%,177.1016,1,272.83,5339.79
+256,256,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0,64,0,111.1445,_ZN5aiter48fmoe_stage1_bf16_pertokenFp8_g1u1_64x128_2tg_pf3E,0.00%,94.3821,moe_ck2stages_gemm2_256x64x128x256_1x4_TypeCast_v1_Nswizzle0_Quant2_MulRoutedWeight1_F8_F8_B16,3.80%,205.5266,0,470.19,4610.84
+256,512,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0,128,0,128.617,_ZN5aiter45fmoe_stage1_bf16_pertokenFp8_g1u1_128x128_pf3E,0.00%,136.7373,moe_ck2stages_gemm2_256x128x128x128_1x4_TypeCast_v3_Nswizzle0_Quant2_MulRoutedWeight1_F8_F8_B16,3.80%,265.3543,0,728.36,3586.08
+256,1024,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0,64,0,227.447,moe_ck2stages_gemm1_256x64x64x256_1x4_TypeCast_v1_Nswizzle0_Quant2_MulRoutedWeight0_silu_F8_F8_B16,0.00%,192.093,moe_ck2stages_gemm2_256x64x128x256_1x4_TypeCast_v1_Nswizzle0_Quant2_MulRoutedWeight1_F8_F8_B16,3.80%,419.54,0,921.36,2286.9
+256,2048,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0,128,0,352.8768,moe_ck2stages_gemm1_256x128x64x128_1x4_TypeCast_v1_Nswizzle0_Quant2_MulRoutedWeight0_silu_F8_F8_B16,0.00%,345.4655,moe_ck2stages_gemm2_256x128x128x128_1x4_TypeCast_v3_Nswizzle0_Quant2_MulRoutedWeight1_F8_F8_B16,3.70%,698.3423,0,1107.04,1396.42
+256,4096,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0,128,0,582.0723,moe_ck2stages_gemm1_256x128x128x128_1x4_TypeCast_v3_Nswizzle0_Quant2_MulRoutedWeight0_silu_F8_F8_B16,0.00%,641.9346,moe_ck2stages_gemm2_256x128x128x128_1x4_TypeCast_v3_Nswizzle0_Quant2_MulRoutedWeight1_F8_F8_B16,3.80%,1224.0069,0,1263.22,822.41


…iniMax-M2.5, DeepSeek-V3.2) (#3004) * Add GLM-4.7 FP8 tuned and untuned FMOE configs * Added MI355X MoE tunings for Kimi-K2 FP4 TP2 * Add MiniMax M25 A8W8 blockscale GEMM tunings * [configs] Add MI355X tuned GEMM and FMoE configs for DeepSeek-V3.2 Add gfx950 (MI355X, cu_num=256) tuning results for A8W8 block-scale GEMM and fused MoE kernels, optimized for DeepSeek-V3.2 shapes. GEMM (a8w8_blockscale_tuned_gemm.csv): - 6375 entries covering M=1..8192 for all DSv32 (N,K) shapes - Includes split-K tuned configs per shape (best of splitK=0 vs splitK>0) - Key decode (M=1) improvements: 128x7168 -59%, 7168x4096 -33% FMoE (tuned_fmoe.csv): - 802 cu_num=256 entries for DSv32 expert dimensions (N=512/4096/4608/7168, K=1536/7168/9216) - Replaces 751 previous cu_num=256 entries with re-tuned results - Existing cu_num=80 (MI300X) entries unchanged Made-with: Cursor * Add MiniMax M25 FMoE tunings * fix: dedup 1692 duplicate entries in tuned_fmoe.csv from merge * fix: remove 446 shapes from main CSV that duplicate ds_v3 model config * fix: remove FMoE shapes that duplicate model-specific configs --------- Co-authored-by: Olga Miroshnichenko <olga.miroshnichenko@amd.com> Co-authored-by: Xavier Aguilar <xavier.aguilarfruto@amd.com> Co-authored-by: Aakif Nawaz <aakif.nawaz@amd.com> Co-authored-by: frida-andersson <fanderss@amd.com>

Squash-merged from main commit 52c4554. Includes 5 atomic Silo PRs: - #2923 GLM-4.7 FP8 tuned/untuned FMoE configs (new) - #2938 Kimi-K2.5 FP4 fused MoE tunings (TP2 / 256 CU refresh) - #2979 MiniMax-M2.5 A8W8 blockscale GEMM tunings - #2981 DeepSeek-V3.2 MI355X tuned GEMM and FMoE configs - #2982 MiniMax-M2.5 FMoE tunings Conflict in aiter/configs/model_configs/kimik2_fp4_tuned_fmoe.csv: two blocks resolved by taking theirs (Silo). Block 1 upgrades existing M=256/N=512 rows from base kernel suffixes (w3) to tuner-discovered variants (w3_xcd4, _bnt2_persist, _sbm32, _sbm64). Block 2 is purely additive: 30+ new rows for previously-uncovered N=7168/K=1024 shapes plus a flydsl_fallback section. Driver: vLLM 0.21 freeze 2026-05-08 — Silo customers need these tunings on the AITER release wheel, not nightly. Verification gate before tag: - Kernel suffix parser smoke (Kimi-K2.5-MXFP4 1-token inference, confirm new suffixes JIT-compile without falling back) - ATOM 5-model accuracy unchanged within +/- 0.005 vs v0.1.13-rc1 - Perf delta on Kimi-K2.5 / MiniMax-M2.5 / DSv3.2 (expect flat or better) (cherry picked from commit 52c4554)

akii96 · 2026-05-04T18:13:42Z

Follow-up note for #3004: please see my comment on #2982 here: #2982 (comment)

The #2982 branch includes CK two-stage instance additions that were not included in this config-only bulk merge.

I was not sure if a seperate PR was needed just for a few additions so I included it in my tunings PR. Apolgies I understand now it was intended to be a pure config only PR, let me know if you need a separate PR on the instances

Add gfx950 (MI355X, cu_num=256) tuning results for A8W8 block-scale GEMM and fused MoE kernels, optimized for DeepSeek-V3.2 shapes. GEMM (a8w8_blockscale_tuned_gemm.csv): - 6375 entries covering M=1..8192 for all DSv32 (N,K) shapes - Includes split-K tuned configs per shape (best of splitK=0 vs splitK>0) - Key decode (M=1) improvements: 128x7168 -59%, 7168x4096 -33% FMoE (tuned_fmoe.csv): - 802 cu_num=256 entries for DSv32 expert dimensions (N=512/4096/4608/7168, K=1536/7168/9216) - Replaces 751 previous cu_num=256 entries with re-tuned results - Existing cu_num=80 (MI300X) entries unchanged Made-with: Cursor (Cherry-picked from #2981 to restore content lost in bulk merge #3004. Net semantic effect of this PR vs current main: GEMM: +6375 / -5 FMoE: +57 / -7 The remaining 1042 of #2981's 1099 textual FMoE adds are content- identical reorderings already present on main.)

* Add MiniMax M25 A8W8 blockscale GEMM tunings on gfx950 (splitK + AQRowMajor) * instances * Add MiniMax M25 FMoE cleaned entries * [configs] Add MI355X tuned GEMM and FMoE configs for DeepSeek-V3.2 Add gfx950 (MI355X, cu_num=256) tuning results for A8W8 block-scale GEMM and fused MoE kernels, optimized for DeepSeek-V3.2 shapes. GEMM (a8w8_blockscale_tuned_gemm.csv): - 6375 entries covering M=1..8192 for all DSv32 (N,K) shapes - Includes split-K tuned configs per shape (best of splitK=0 vs splitK>0) - Key decode (M=1) improvements: 128x7168 -59%, 7168x4096 -33% FMoE (tuned_fmoe.csv): - 802 cu_num=256 entries for DSv32 expert dimensions (N=512/4096/4608/7168, K=1536/7168/9216) - Replaces 751 previous cu_num=256 entries with re-tuned results - Existing cu_num=80 (MI300X) entries unchanged Made-with: Cursor (Cherry-picked from #2981 to restore content lost in bulk merge #3004. Net semantic effect of this PR vs current main: GEMM: +6375 / -5 FMoE: +57 / -7 The remaining 1042 of #2981's 1099 textual FMoE adds are content- identical reorderings already present on main.) * [configs] dedup colliding (M,N,K,cu_num,gfx) shapes between #2981 and ds_v3 #2981 added 223 DSv3.2 GEMM tunings to a8w8_blockscale_tuned_gemm.csv that share (M,N,K,cu_num,gfx) shape keys with the model-specific a8w8_blockscale_tuned_gemm_ds_v3.csv. The aiter build asserts no shape collisions across merged config files; resolve by keeping the lowest 'us' row per shape: - 187 ds_v3 rows dropped (superseded by #2981's better tunings) - 36 #2981 rows dropped (ds_v3's existing tunings were faster) Result: every conflicting shape still has a tuning, picked from whichever file had the better measurement. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [configs] dedup colliding FMoE shapes across global + model_configs files Same lowest-'us'-wins resolution as the GEMM dedup, applied to FMoE. The build's update_config_files asserts no shape collisions across merged tuned_fmoe files (key = untuned_fmoe.csv columns + cu_num + _tag); the additions in this PR introduced 65 cross-file collisions between tuned_fmoe.csv and 4 model_configs files. Resolution (best 'us' per shape): - tuned_fmoe.csv: 1080 -> 1039 rows (lost 41 to model files with better existing tunings — mostly 26 minimax + 10 glm47 + 4 ds_v3 + 1 qwen3_235b) - a8w8_blockscale_tuned_fmoe_ds_v3.csv: 16 -> 4 (12 superseded by #2981) - a8w8_blockscale_tuned_fmoe_minimax-m2_5.csv: 32 -> 26 (6 superseded by #2982) - a8w8_blockscale_tuned_fmoe_qwen3_235b.csv: 32 -> 28 (4 superseded by #2981) - glm47_fp8_tuned_fmoe.csv: 16 -> 14 (2 superseded by #2981) Every shape contributed by #2981 and #2982 remains covered post-dedup — where their row was not the winner, the existing model-specific tuning wins on its own merits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [configs] Move MiniMax FMoE tunings to model config Keep the surviving MiniMax M25 FMoE rows in the model-specific config file instead of the global tuned_fmoe table. * [configs] Move DeepSeek tunings to model configs Keep the surviving DeepSeek V3.2 tuning rows in model-specific config files instead of the global tuning tables. * [configs] Move GLM FMoE rows back to GLM config Address likely MiniMax tuning contamination by moving the per-token GLM-4.7 FMoE entries back to the GLM model config where they belong. * [moe] Disable blockscale GEMM2 instance that exceeds LDS Avoid prebuilding the 256x128x128x128 2x2 blockscale GEMM2 candidate, which exceeds the local memory limit during JIT compilation. * [moe] Disable risky/unused blockscale MoE instances --------- Co-authored-by: Aakif Nawaz <aaknawaz@amd.com> Co-authored-by: Aakif Nawaz <aakif.nawaz@amd.com> Co-authored-by: frida-andersson <fanderss@amd.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Cherry-pick of 1638f9e from main onto release/v0.1.13. Conflict in a8w8_blockscale_tuned_gemm_ds_v3.csv resolved by taking theirs. Original PR: #3024

…iniMax-M2.5, DeepSeek-V3.2) (#3004) * Add GLM-4.7 FP8 tuned and untuned FMOE configs * Added MI355X MoE tunings for Kimi-K2 FP4 TP2 * Add MiniMax M25 A8W8 blockscale GEMM tunings * [configs] Add MI355X tuned GEMM and FMoE configs for DeepSeek-V3.2 Add gfx950 (MI355X, cu_num=256) tuning results for A8W8 block-scale GEMM and fused MoE kernels, optimized for DeepSeek-V3.2 shapes. GEMM (a8w8_blockscale_tuned_gemm.csv): - 6375 entries covering M=1..8192 for all DSv32 (N,K) shapes - Includes split-K tuned configs per shape (best of splitK=0 vs splitK>0) - Key decode (M=1) improvements: 128x7168 -59%, 7168x4096 -33% FMoE (tuned_fmoe.csv): - 802 cu_num=256 entries for DSv32 expert dimensions (N=512/4096/4608/7168, K=1536/7168/9216) - Replaces 751 previous cu_num=256 entries with re-tuned results - Existing cu_num=80 (MI300X) entries unchanged Made-with: Cursor * Add MiniMax M25 FMoE tunings * fix: dedup 1692 duplicate entries in tuned_fmoe.csv from merge * fix: remove 446 shapes from main CSV that duplicate ds_v3 model config * fix: remove FMoE shapes that duplicate model-specific configs --------- Co-authored-by: Olga Miroshnichenko <olga.miroshnichenko@amd.com> Co-authored-by: Xavier Aguilar <xavier.aguilarfruto@amd.com> Co-authored-by: Aakif Nawaz <aakif.nawaz@amd.com> Co-authored-by: frida-andersson <fanderss@amd.com>

* Add MiniMax M25 A8W8 blockscale GEMM tunings on gfx950 (splitK + AQRowMajor) * instances * Add MiniMax M25 FMoE cleaned entries * [configs] Add MI355X tuned GEMM and FMoE configs for DeepSeek-V3.2 Add gfx950 (MI355X, cu_num=256) tuning results for A8W8 block-scale GEMM and fused MoE kernels, optimized for DeepSeek-V3.2 shapes. GEMM (a8w8_blockscale_tuned_gemm.csv): - 6375 entries covering M=1..8192 for all DSv32 (N,K) shapes - Includes split-K tuned configs per shape (best of splitK=0 vs splitK>0) - Key decode (M=1) improvements: 128x7168 -59%, 7168x4096 -33% FMoE (tuned_fmoe.csv): - 802 cu_num=256 entries for DSv32 expert dimensions (N=512/4096/4608/7168, K=1536/7168/9216) - Replaces 751 previous cu_num=256 entries with re-tuned results - Existing cu_num=80 (MI300X) entries unchanged Made-with: Cursor (Cherry-picked from #2981 to restore content lost in bulk merge #3004. Net semantic effect of this PR vs current main: GEMM: +6375 / -5 FMoE: +57 / -7 The remaining 1042 of #2981's 1099 textual FMoE adds are content- identical reorderings already present on main.) * [configs] dedup colliding (M,N,K,cu_num,gfx) shapes between #2981 and ds_v3 #2981 added 223 DSv3.2 GEMM tunings to a8w8_blockscale_tuned_gemm.csv that share (M,N,K,cu_num,gfx) shape keys with the model-specific a8w8_blockscale_tuned_gemm_ds_v3.csv. The aiter build asserts no shape collisions across merged config files; resolve by keeping the lowest 'us' row per shape: - 187 ds_v3 rows dropped (superseded by #2981's better tunings) - 36 #2981 rows dropped (ds_v3's existing tunings were faster) Result: every conflicting shape still has a tuning, picked from whichever file had the better measurement. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [configs] dedup colliding FMoE shapes across global + model_configs files Same lowest-'us'-wins resolution as the GEMM dedup, applied to FMoE. The build's update_config_files asserts no shape collisions across merged tuned_fmoe files (key = untuned_fmoe.csv columns + cu_num + _tag); the additions in this PR introduced 65 cross-file collisions between tuned_fmoe.csv and 4 model_configs files. Resolution (best 'us' per shape): - tuned_fmoe.csv: 1080 -> 1039 rows (lost 41 to model files with better existing tunings — mostly 26 minimax + 10 glm47 + 4 ds_v3 + 1 qwen3_235b) - a8w8_blockscale_tuned_fmoe_ds_v3.csv: 16 -> 4 (12 superseded by #2981) - a8w8_blockscale_tuned_fmoe_minimax-m2_5.csv: 32 -> 26 (6 superseded by #2982) - a8w8_blockscale_tuned_fmoe_qwen3_235b.csv: 32 -> 28 (4 superseded by #2981) - glm47_fp8_tuned_fmoe.csv: 16 -> 14 (2 superseded by #2981) Every shape contributed by #2981 and #2982 remains covered post-dedup — where their row was not the winner, the existing model-specific tuning wins on its own merits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [configs] Move MiniMax FMoE tunings to model config Keep the surviving MiniMax M25 FMoE rows in the model-specific config file instead of the global tuned_fmoe table. * [configs] Move DeepSeek tunings to model configs Keep the surviving DeepSeek V3.2 tuning rows in model-specific config files instead of the global tuning tables. * [configs] Move GLM FMoE rows back to GLM config Address likely MiniMax tuning contamination by moving the per-token GLM-4.7 FMoE entries back to the GLM model config where they belong. * [moe] Disable blockscale GEMM2 instance that exceeds LDS Avoid prebuilding the 256x128x128x128 2x2 blockscale GEMM2 candidate, which exceeds the local memory limit during JIT compilation. * [moe] Disable risky/unused blockscale MoE instances --------- Co-authored-by: Aakif Nawaz <aaknawaz@amd.com> Co-authored-by: Aakif Nawaz <aakif.nawaz@amd.com> Co-authored-by: frida-andersson <fanderss@amd.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

omirosh and others added 10 commits April 27, 2026 12:16

Add GLM-4.7 FP8 tuned and untuned FMOE configs

262aca4

Added MI355X MoE tunings for Kimi-K2 FP4 TP2

8920160

Add MiniMax M25 A8W8 blockscale GEMM tunings

95cf9fd

Add MiniMax M25 FMoE tunings

1979e04

Merge branch 'pr-2923' into silo/v0.1.13-configs

0e34cfe

Merge branch 'pr-2938' into silo/v0.1.13-configs

327be9c

Merge branch 'pr-2979' into silo/v0.1.13-configs

8d13b35

Merge config PR #2981 (auto-resolved CSV conflicts)

9eccc1a

Merge config PR #2982 (auto-resolved)

4590bf3

sunway513 requested review from a team and Copilot May 1, 2026 23:30

Copilot started reviewing on behalf of sunway513 May 1, 2026 23:31 View session

Copilot AI reviewed May 1, 2026

View reviewed changes

sunway513 added 3 commits May 1, 2026 23:43

fix: dedup 1692 duplicate entries in tuned_fmoe.csv from merge

a03d001

fix: remove 446 shapes from main CSV that duplicate ds_v3 model config

d17a504

fix: remove FMoE shapes that duplicate model-specific configs

90433f6

sunway513 merged commit 52c4554 into main May 2, 2026
34 of 35 checks passed

sunway513 deleted the silo/v0.1.13-configs branch May 2, 2026 03:16

frida-andersson mentioned this pull request May 4, 2026

[configs] Add MI355X tuned GEMM and FMoE configs for DeepSeek-V3.2 #2981

Closed

This was referenced May 4, 2026

Add GLM-4.7 FP8 tuned and untuned FMOE configs #2923

Closed

refresh Kimi K2 FP4 fused MoE tunings (TP2 / 256 CU) #2938

Closed

Add MiniMax M25 A8W8 blockscale GEMM tunings #2979

Closed

Add MiniMax M25 FMoE tunings #2982

Closed

akii96 mentioned this pull request May 4, 2026

Add MiniMax M25 FMoE Instances support #3023

Closed

azaidy mentioned this pull request May 4, 2026

[Silo] Add configs missing from bulk merge #3004 #3024

Merged

2 tasks

sunway513 added a commit that referenced this pull request May 5, 2026

[Silo] Add configs missing from bulk merge #3004 (#3024)

34f2c97

Cherry-pick of 1638f9e from main onto release/v0.1.13. Conflict in a8w8_blockscale_tuned_gemm_ds_v3.csv resolved by taking theirs. Original PR: #3024

wuhuikx mentioned this pull request May 6, 2026

Upgrade the aiter version to v0.1.13-rc4 vllm-project/vllm#41786

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Silo] Bulk merge: tuned GEMM and FMoE configs (GLM-4.7, Kimi-K2.5, MiniMax-M2.5, DeepSeek-V3.2)#3004

[Silo] Bulk merge: tuned GEMM and FMoE configs (GLM-4.7, Kimi-K2.5, MiniMax-M2.5, DeepSeek-V3.2)#3004
sunway513 merged 13 commits intomainfrom
silo/v0.1.13-configs

sunway513 commented May 1, 2026

Uh oh!

github-actions Bot commented May 1, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

akii96 commented May 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

sunway513 commented May 1, 2026

Summary

Included PRs

Files Changed

Risk

Test Plan

Uh oh!

github-actions Bot commented May 1, 2026

🏷️ CI Guide

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

akii96 commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

akii96 commented May 4, 2026 •

edited

Loading