* Add MiniMax M25 A8W8 blockscale GEMM tunings on gfx950 (splitK + AQRowMajor)
* instances
* Add MiniMax M25 FMoE cleaned entries
* [configs] Add MI355X tuned GEMM and FMoE configs for DeepSeek-V3.2
Add gfx950 (MI355X, cu_num=256) tuning results for A8W8 block-scale
GEMM and fused MoE kernels, optimized for DeepSeek-V3.2 shapes.
GEMM (a8w8_blockscale_tuned_gemm.csv):
- 6375 entries covering M=1..8192 for all DSv32 (N,K) shapes
- Includes split-K tuned configs per shape (best of splitK=0 vs splitK>0)
- Key decode (M=1) improvements: 128x7168 -59%, 7168x4096 -33%
FMoE (tuned_fmoe.csv):
- 802 cu_num=256 entries for DSv32 expert dimensions
(N=512/4096/4608/7168, K=1536/7168/9216)
- Replaces 751 previous cu_num=256 entries with re-tuned results
- Existing cu_num=80 (MI300X) entries unchanged
Made-with: Cursor
(Cherry-picked from #2981 to restore content lost in bulk merge #3004.
Net semantic effect of this PR vs current main:
GEMM: +6375 / -5
FMoE: +57 / -7
The remaining 1042 of #2981's 1099 textual FMoE adds are content-
identical reorderings already present on main.)
* [configs] dedup colliding (M,N,K,cu_num,gfx) shapes between #2981 and ds_v3
#2981 added 223 DSv3.2 GEMM tunings to a8w8_blockscale_tuned_gemm.csv
that share (M,N,K,cu_num,gfx) shape keys with the model-specific
a8w8_blockscale_tuned_gemm_ds_v3.csv. The aiter build asserts no shape
collisions across merged config files; resolve by keeping the lowest
'us' row per shape:
- 187 ds_v3 rows dropped (superseded by #2981's better tunings)
- 36 #2981 rows dropped (ds_v3's existing tunings were faster)
Result: every conflicting shape still has a tuning, picked from
whichever file had the better measurement.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* [configs] dedup colliding FMoE shapes across global + model_configs files
Same lowest-'us'-wins resolution as the GEMM dedup, applied to FMoE.
The build's update_config_files asserts no shape collisions across
merged tuned_fmoe files (key = untuned_fmoe.csv columns + cu_num + _tag);
the additions in this PR introduced 65 cross-file collisions between
tuned_fmoe.csv and 4 model_configs files.
Resolution (best 'us' per shape):
- tuned_fmoe.csv: 1080 -> 1039 rows (lost 41 to model files with better
existing tunings — mostly 26 minimax + 10 glm47 + 4 ds_v3 + 1 qwen3_235b)
- a8w8_blockscale_tuned_fmoe_ds_v3.csv: 16 -> 4 (12 superseded by #2981)
- a8w8_blockscale_tuned_fmoe_minimax-m2_5.csv: 32 -> 26 (6 superseded by #2982)
- a8w8_blockscale_tuned_fmoe_qwen3_235b.csv: 32 -> 28 (4 superseded by #2981)
- glm47_fp8_tuned_fmoe.csv: 16 -> 14 (2 superseded by #2981)
Every shape contributed by #2981 and #2982 remains covered post-dedup —
where their row was not the winner, the existing model-specific tuning
wins on its own merits.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* [configs] Move MiniMax FMoE tunings to model config
Keep the surviving MiniMax M25 FMoE rows in the model-specific config
file instead of the global tuned_fmoe table.
* [configs] Move DeepSeek tunings to model configs
Keep the surviving DeepSeek V3.2 tuning rows in model-specific config
files instead of the global tuning tables.
* [configs] Move GLM FMoE rows back to GLM config
Address likely MiniMax tuning contamination by moving the per-token
GLM-4.7 FMoE entries back to the GLM model config where they belong.
* [moe] Disable blockscale GEMM2 instance that exceeds LDS
Avoid prebuilding the 256x128x128x128 2x2 blockscale GEMM2
candidate, which exceeds the local memory limit during JIT compilation.
* [moe] Disable risky/unused blockscale MoE instances
---------
Co-authored-by: Aakif Nawaz <aaknawaz@amd.com>
Co-authored-by: Aakif Nawaz <aakif.nawaz@amd.com>
Co-authored-by: frida-andersson <fanderss@amd.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Restores configs that were silently dropped from the bulk merge in #3004. The dedup commits in that PR overshot — #2979 and #2982 lost 100% of their content, and #2981 lost 223 GEMM rows + 57 FMoE rows.
After this PR merges, main will reflect every source PR originally listed in #3004.
Source PR coverage (after this PR merges)
model_configs/glm47_fp8_tuned_fmoe.csvmodel_configs/glm47_fp8_untuned_fmoe.csvmodel_configs/kimik2_fp4_tuned_fmoe.csva8w8_blockscale_tuned_gemm.csva8w8_blockscale_tuned_gemm.csvtuned_fmoe.csvtuned_fmoe.csvcsrc/ck_gemm_moe_2stages_codegen/gemm_moe_ck2stages_common.pyCounts are computed as semantic set deltas (PR base set vs PR HEAD set), so reorderings don't inflate them.
What this PR adds (vs current main)
aiter/configs/a8w8_blockscale_tuned_gemm.csv— +6853 / -0aiter/configs/tuned_fmoe.csv— +133 / -0csrc/ck_gemm_moe_2stages_codegen/gemm_moe_ck2stages_common.py— +6 / -2Commits
Original authorship preserved via cherry-pick:
96e0fb7b(Aakif Nawaz) — Add MiniMax M25 A8W8 blockscale GEMM tunings #2979 MiniMax M25 A8W8 blockscale GEMM tuningsdca7c8ac(Aakif Nawaz) — Add MiniMax M25 FMoE tunings #2982 codegen instancesa08d8b38(Aakif Nawaz) — Add MiniMax M25 FMoE tunings #2982 MiniMax M25 FMoE entriesfcfa67d5(frida-andersson) — [configs] Add MI355X tuned GEMM and FMoE configs for DeepSeek-V3.2 #2981 DeepSeek-V3.2 (223 GEMM + 57 FMoE net-new vs main)Duplicate audit
a8w8_blockscale_tuned_gemm.csvtuned_fmoe.csv@sunway513's dedup intent from #3004 is preserved — no conflicting shapes are added. The 432 cu_num=80 shape duplicates that pre-exist in
tuned_fmoe.csvon main are out of scope for this PR (leftover from the earlier dedup pass).Risk
Low — CSV configs and a small codegen change (already used elsewhere in #2982). No kernel code, no API changes.
Test plan