Conversation
🏷️ CI GuideRuns automatically on every PR:
Extended tests (opt-in via labels):
|
1979e04 to
374f437
Compare
Latest updates:
|
|
This PR's content was bulk-merged via #3004 ([Silo] Bulk merge: tuned GEMM and FMoE configs, merged 2026-05-02 03:16 UTC). Please close this PR as superseded. Tracking issue: ROCm/AI-Frameworks-Dashboard#141 |
Squash-merged from main commit 52c4554. Includes 5 atomic Silo PRs: - #2923 GLM-4.7 FP8 tuned/untuned FMoE configs (new) - #2938 Kimi-K2.5 FP4 fused MoE tunings (TP2 / 256 CU refresh) - #2979 MiniMax-M2.5 A8W8 blockscale GEMM tunings - #2981 DeepSeek-V3.2 MI355X tuned GEMM and FMoE configs - #2982 MiniMax-M2.5 FMoE tunings Conflict in aiter/configs/model_configs/kimik2_fp4_tuned_fmoe.csv: two blocks resolved by taking theirs (Silo). Block 1 upgrades existing M=256/N=512 rows from base kernel suffixes (w3) to tuner-discovered variants (w3_xcd4, _bnt2_persist, _sbm32, _sbm64). Block 2 is purely additive: 30+ new rows for previously-uncovered N=7168/K=1024 shapes plus a flydsl_fallback section. Driver: vLLM 0.21 freeze 2026-05-08 — Silo customers need these tunings on the AITER release wheel, not nightly. Verification gate before tag: - Kernel suffix parser smoke (Kimi-K2.5-MXFP4 1-token inference, confirm new suffixes JIT-compile without falling back) - ATOM 5-model accuracy unchanged within +/- 0.005 vs v0.1.13-rc1 - Perf delta on Kimi-K2.5 / MiniMax-M2.5 / DSv3.2 (expect flat or better) (cherry picked from commit 52c4554)
|
Hi @sunway513 One clarification before closing: this PR is not fully covered by the #3004 config-only bulk merge. In aiter/configs/tuned_fmoe.csv, the new MiniMax FMoE rows for token=4096 and token=8192 with model_dim=3072, inter_dim=384 reference the added 256x64x128x128 ... A8W8blkscale_v1 CK two-stage instances from csrc/ck_gemm_moe_2stages_codegen/gemm_moe_ck2stages_common.py. So those CSV rows do depend on the instance additions, which were left out of #3004. |
…iles Same lowest-'us'-wins resolution as the GEMM dedup, applied to FMoE. The build's update_config_files asserts no shape collisions across merged tuned_fmoe files (key = untuned_fmoe.csv columns + cu_num + _tag); the additions in this PR introduced 65 cross-file collisions between tuned_fmoe.csv and 4 model_configs files. Resolution (best 'us' per shape): - tuned_fmoe.csv: 1080 -> 1039 rows (lost 41 to model files with better existing tunings — mostly 26 minimax + 10 glm47 + 4 ds_v3 + 1 qwen3_235b) - a8w8_blockscale_tuned_fmoe_ds_v3.csv: 16 -> 4 (12 superseded by #2981) - a8w8_blockscale_tuned_fmoe_minimax-m2_5.csv: 32 -> 26 (6 superseded by #2982) - a8w8_blockscale_tuned_fmoe_qwen3_235b.csv: 32 -> 28 (4 superseded by #2981) - glm47_fp8_tuned_fmoe.csv: 16 -> 14 (2 superseded by #2981) Every shape contributed by #2981 and #2982 remains covered post-dedup — where their row was not the winner, the existing model-specific tuning wins on its own merits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Add MiniMax M25 A8W8 blockscale GEMM tunings on gfx950 (splitK + AQRowMajor) * instances * Add MiniMax M25 FMoE cleaned entries * [configs] Add MI355X tuned GEMM and FMoE configs for DeepSeek-V3.2 Add gfx950 (MI355X, cu_num=256) tuning results for A8W8 block-scale GEMM and fused MoE kernels, optimized for DeepSeek-V3.2 shapes. GEMM (a8w8_blockscale_tuned_gemm.csv): - 6375 entries covering M=1..8192 for all DSv32 (N,K) shapes - Includes split-K tuned configs per shape (best of splitK=0 vs splitK>0) - Key decode (M=1) improvements: 128x7168 -59%, 7168x4096 -33% FMoE (tuned_fmoe.csv): - 802 cu_num=256 entries for DSv32 expert dimensions (N=512/4096/4608/7168, K=1536/7168/9216) - Replaces 751 previous cu_num=256 entries with re-tuned results - Existing cu_num=80 (MI300X) entries unchanged Made-with: Cursor (Cherry-picked from #2981 to restore content lost in bulk merge #3004. Net semantic effect of this PR vs current main: GEMM: +6375 / -5 FMoE: +57 / -7 The remaining 1042 of #2981's 1099 textual FMoE adds are content- identical reorderings already present on main.) * [configs] dedup colliding (M,N,K,cu_num,gfx) shapes between #2981 and ds_v3 #2981 added 223 DSv3.2 GEMM tunings to a8w8_blockscale_tuned_gemm.csv that share (M,N,K,cu_num,gfx) shape keys with the model-specific a8w8_blockscale_tuned_gemm_ds_v3.csv. The aiter build asserts no shape collisions across merged config files; resolve by keeping the lowest 'us' row per shape: - 187 ds_v3 rows dropped (superseded by #2981's better tunings) - 36 #2981 rows dropped (ds_v3's existing tunings were faster) Result: every conflicting shape still has a tuning, picked from whichever file had the better measurement. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [configs] dedup colliding FMoE shapes across global + model_configs files Same lowest-'us'-wins resolution as the GEMM dedup, applied to FMoE. The build's update_config_files asserts no shape collisions across merged tuned_fmoe files (key = untuned_fmoe.csv columns + cu_num + _tag); the additions in this PR introduced 65 cross-file collisions between tuned_fmoe.csv and 4 model_configs files. Resolution (best 'us' per shape): - tuned_fmoe.csv: 1080 -> 1039 rows (lost 41 to model files with better existing tunings — mostly 26 minimax + 10 glm47 + 4 ds_v3 + 1 qwen3_235b) - a8w8_blockscale_tuned_fmoe_ds_v3.csv: 16 -> 4 (12 superseded by #2981) - a8w8_blockscale_tuned_fmoe_minimax-m2_5.csv: 32 -> 26 (6 superseded by #2982) - a8w8_blockscale_tuned_fmoe_qwen3_235b.csv: 32 -> 28 (4 superseded by #2981) - glm47_fp8_tuned_fmoe.csv: 16 -> 14 (2 superseded by #2981) Every shape contributed by #2981 and #2982 remains covered post-dedup — where their row was not the winner, the existing model-specific tuning wins on its own merits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [configs] Move MiniMax FMoE tunings to model config Keep the surviving MiniMax M25 FMoE rows in the model-specific config file instead of the global tuned_fmoe table. * [configs] Move DeepSeek tunings to model configs Keep the surviving DeepSeek V3.2 tuning rows in model-specific config files instead of the global tuning tables. * [configs] Move GLM FMoE rows back to GLM config Address likely MiniMax tuning contamination by moving the per-token GLM-4.7 FMoE entries back to the GLM model config where they belong. * [moe] Disable blockscale GEMM2 instance that exceeds LDS Avoid prebuilding the 256x128x128x128 2x2 blockscale GEMM2 candidate, which exceeds the local memory limit during JIT compilation. * [moe] Disable risky/unused blockscale MoE instances --------- Co-authored-by: Aakif Nawaz <aaknawaz@amd.com> Co-authored-by: Aakif Nawaz <aakif.nawaz@amd.com> Co-authored-by: frida-andersson <fanderss@amd.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Merged with #3024 |
* Add MiniMax M25 A8W8 blockscale GEMM tunings on gfx950 (splitK + AQRowMajor) * instances * Add MiniMax M25 FMoE cleaned entries * [configs] Add MI355X tuned GEMM and FMoE configs for DeepSeek-V3.2 Add gfx950 (MI355X, cu_num=256) tuning results for A8W8 block-scale GEMM and fused MoE kernels, optimized for DeepSeek-V3.2 shapes. GEMM (a8w8_blockscale_tuned_gemm.csv): - 6375 entries covering M=1..8192 for all DSv32 (N,K) shapes - Includes split-K tuned configs per shape (best of splitK=0 vs splitK>0) - Key decode (M=1) improvements: 128x7168 -59%, 7168x4096 -33% FMoE (tuned_fmoe.csv): - 802 cu_num=256 entries for DSv32 expert dimensions (N=512/4096/4608/7168, K=1536/7168/9216) - Replaces 751 previous cu_num=256 entries with re-tuned results - Existing cu_num=80 (MI300X) entries unchanged Made-with: Cursor (Cherry-picked from #2981 to restore content lost in bulk merge #3004. Net semantic effect of this PR vs current main: GEMM: +6375 / -5 FMoE: +57 / -7 The remaining 1042 of #2981's 1099 textual FMoE adds are content- identical reorderings already present on main.) * [configs] dedup colliding (M,N,K,cu_num,gfx) shapes between #2981 and ds_v3 #2981 added 223 DSv3.2 GEMM tunings to a8w8_blockscale_tuned_gemm.csv that share (M,N,K,cu_num,gfx) shape keys with the model-specific a8w8_blockscale_tuned_gemm_ds_v3.csv. The aiter build asserts no shape collisions across merged config files; resolve by keeping the lowest 'us' row per shape: - 187 ds_v3 rows dropped (superseded by #2981's better tunings) - 36 #2981 rows dropped (ds_v3's existing tunings were faster) Result: every conflicting shape still has a tuning, picked from whichever file had the better measurement. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [configs] dedup colliding FMoE shapes across global + model_configs files Same lowest-'us'-wins resolution as the GEMM dedup, applied to FMoE. The build's update_config_files asserts no shape collisions across merged tuned_fmoe files (key = untuned_fmoe.csv columns + cu_num + _tag); the additions in this PR introduced 65 cross-file collisions between tuned_fmoe.csv and 4 model_configs files. Resolution (best 'us' per shape): - tuned_fmoe.csv: 1080 -> 1039 rows (lost 41 to model files with better existing tunings — mostly 26 minimax + 10 glm47 + 4 ds_v3 + 1 qwen3_235b) - a8w8_blockscale_tuned_fmoe_ds_v3.csv: 16 -> 4 (12 superseded by #2981) - a8w8_blockscale_tuned_fmoe_minimax-m2_5.csv: 32 -> 26 (6 superseded by #2982) - a8w8_blockscale_tuned_fmoe_qwen3_235b.csv: 32 -> 28 (4 superseded by #2981) - glm47_fp8_tuned_fmoe.csv: 16 -> 14 (2 superseded by #2981) Every shape contributed by #2981 and #2982 remains covered post-dedup — where their row was not the winner, the existing model-specific tuning wins on its own merits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [configs] Move MiniMax FMoE tunings to model config Keep the surviving MiniMax M25 FMoE rows in the model-specific config file instead of the global tuned_fmoe table. * [configs] Move DeepSeek tunings to model configs Keep the surviving DeepSeek V3.2 tuning rows in model-specific config files instead of the global tuning tables. * [configs] Move GLM FMoE rows back to GLM config Address likely MiniMax tuning contamination by moving the per-token GLM-4.7 FMoE entries back to the GLM model config where they belong. * [moe] Disable blockscale GEMM2 instance that exceeds LDS Avoid prebuilding the 256x128x128x128 2x2 blockscale GEMM2 candidate, which exceeds the local memory limit during JIT compilation. * [moe] Disable risky/unused blockscale MoE instances --------- Co-authored-by: Aakif Nawaz <aaknawaz@amd.com> Co-authored-by: Aakif Nawaz <aakif.nawaz@amd.com> Co-authored-by: frida-andersson <fanderss@amd.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds MiniMax M25 FMoE tuning entries and keeps the tuning table deduplicated and sorted by token