[Silo] Add configs missing from bulk merge #3004 by azaidy · Pull Request #3024 · ROCm/aiter

azaidy · 2026-05-04T20:28:13Z

Summary

Restores configs that were silently dropped from the bulk merge in #3004. The dedup commits in that PR overshot — #2979 and #2982 lost 100% of their content, and #2981 lost 223 GEMM rows + 57 FMoE rows.

After this PR merges, main will reflect every source PR originally listed in #3004.

Source PR coverage (after this PR merges)

Source PR	File	Semantic adds	Semantic removes	Where it lands
#2923 GLM-4.7	`model_configs/glm47_fp8_tuned_fmoe.csv`	16/16 ✓	0/0 ✓	already on main from #3004
#2923 GLM-4.7	`model_configs/glm47_fp8_untuned_fmoe.csv`	16/16 ✓	0/0 ✓	already on main from #3004
#2938 Kimi K2.5	`model_configs/kimik2_fp4_tuned_fmoe.csv`	160/160 ✓	128/128 ✓	already on main from #3004
#2979 MiniMax M2.5 GEMM	`a8w8_blockscale_tuned_gemm.csv`	6630/6630 ✓	0/0 ✓	this PR (was 0% on main)
#2981 DeepSeek-V3.2	`a8w8_blockscale_tuned_gemm.csv`	6375/6375 ✓	5/5 ✓	6152 GEMM rows already on main + 223 net-new in this PR
#2981 DeepSeek-V3.2	`tuned_fmoe.csv`	57/57 ✓	7/7 ✓	1042 reorderings already on main + 57 net-new in this PR
#2982 MiniMax M2.5 FMoE	`tuned_fmoe.csv`	76/76 ✓	0/0 ✓	this PR (was 0% on main)
#2982 MiniMax M2.5 FMoE	`csrc/ck_gemm_moe_2stages_codegen/gemm_moe_ck2stages_common.py`	6/6 ✓	2/2 ✓	this PR (was 0% on main)

Counts are computed as semantic set deltas (PR base set vs PR HEAD set), so reorderings don't inflate them.

What this PR adds (vs current main)

aiter/configs/a8w8_blockscale_tuned_gemm.csv — +6853 / -0
aiter/configs/tuned_fmoe.csv — +133 / -0
csrc/ck_gemm_moe_2stages_codegen/gemm_moe_ck2stages_common.py — +6 / -2

Commits

Original authorship preserved via cherry-pick:

96e0fb7b (Aakif Nawaz) — Add MiniMax M25 A8W8 blockscale GEMM tunings #2979 MiniMax M25 A8W8 blockscale GEMM tunings
dca7c8ac (Aakif Nawaz) — Add MiniMax M25 FMoE tunings #2982 codegen instances
a08d8b38 (Aakif Nawaz) — Add MiniMax M25 FMoE tunings #2982 MiniMax M25 FMoE entries
fcfa67d5 (frida-andersson) — [configs] Add MI355X tuned GEMM and FMoE configs for DeepSeek-V3.2 #2981 DeepSeek-V3.2 (223 GEMM + 57 FMoE net-new vs main)

Duplicate audit

File	Rows	Exact-row dup	Shape dup
`a8w8_blockscale_tuned_gemm.csv`	13005	0	0
`tuned_fmoe.csv`	1080	0	0

@sunway513's dedup intent from #3004 is preserved — no conflicting shapes are added. The 432 cu_num=80 shape duplicates that pre-exist in tuned_fmoe.csv on main are out of scope for this PR (leftover from the earlier dedup pass).

Risk

Low — CSV configs and a small codegen change (already used elsewhere in #2982). No kernel code, no API changes.

Test plan

CI passes
Spot check: ATOM loads new configs without error for MiniMax M2.5 and DeepSeek-V3.2

…wMajor)

github-actions · 2026-05-04T20:29:11Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:triton-300x`	Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
`ci:sglang`	SGLang integration tests
`ci:atom`	ATOM benchmark (DeepSeek-R1 + GPT-OSS)
`ci:vllm`	vLLM benchmark
`ci:all`	All of the above

Add labels via the sidebar or gh pr edit 3024 --add-label <label>

github-actions · 2026-05-04T20:54:46Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:triton-300x`	Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
`ci:sglang`	SGLang integration tests
`ci:atom`	ATOM benchmark (DeepSeek-R1 + GPT-OSS)
`ci:vllm`	vLLM benchmark
`ci:all`	All of the above

Add labels via the sidebar or gh pr edit 3024 --add-label <label>

Add gfx950 (MI355X, cu_num=256) tuning results for A8W8 block-scale GEMM and fused MoE kernels, optimized for DeepSeek-V3.2 shapes. GEMM (a8w8_blockscale_tuned_gemm.csv): - 6375 entries covering M=1..8192 for all DSv32 (N,K) shapes - Includes split-K tuned configs per shape (best of splitK=0 vs splitK>0) - Key decode (M=1) improvements: 128x7168 -59%, 7168x4096 -33% FMoE (tuned_fmoe.csv): - 802 cu_num=256 entries for DSv32 expert dimensions (N=512/4096/4608/7168, K=1536/7168/9216) - Replaces 751 previous cu_num=256 entries with re-tuned results - Existing cu_num=80 (MI300X) entries unchanged Made-with: Cursor (Cherry-picked from #2981 to restore content lost in bulk merge #3004. Net semantic effect of this PR vs current main: GEMM: +6375 / -5 FMoE: +57 / -7 The remaining 1042 of #2981's 1099 textual FMoE adds are content- identical reorderings already present on main.)

… ds_v3 #2981 added 223 DSv3.2 GEMM tunings to a8w8_blockscale_tuned_gemm.csv that share (M,N,K,cu_num,gfx) shape keys with the model-specific a8w8_blockscale_tuned_gemm_ds_v3.csv. The aiter build asserts no shape collisions across merged config files; resolve by keeping the lowest 'us' row per shape: - 187 ds_v3 rows dropped (superseded by #2981's better tunings) - 36 #2981 rows dropped (ds_v3's existing tunings were faster) Result: every conflicting shape still has a tuning, picked from whichever file had the better measurement. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

azaidy · 2026-05-04T21:58:49Z

Cross-file shape collision dedup applied (both GEMM and FMoE) — updated

Update: Originally I posted that I'd applied the lowest-us-wins dedup to resolve cross-file collisions. Since then, @frida-andersson (b722ba78) and @akii96 (89e9e860) pushed commits that move the surviving DSv3.2 and MiniMax M2.5 tunings out of the global a8w8_blockscale_tuned_gemm.csv / tuned_fmoe.csv and into their proper model_configs/ files. That cleanup is the right place for them to live and naturally resolves the collisions on top of the dedup.

Final state

Source PR	Now lives in	Coverage
#2923 GLM-4.7	`model_configs/glm47_fp8_tuned_fmoe.csv`	16/16 ✓
#2938 Kimi K2.5	`model_configs/kimik2_fp4_tuned_fmoe.csv`	32/32 ✓
#2979 MiniMax M2.5 GEMM	global `a8w8_blockscale_tuned_gemm.csv`	6630/6630 ✓
#2981 DSv3.2 GEMM	`model_configs/a8w8_blockscale_tuned_gemm_ds_v3.csv` (relocated by `b722ba78`)	6370/6370 ✓
#2981 DSv3.2 FMoE	`model_configs/a8w8_blockscale_tuned_fmoe_ds_v3.csv` (relocated by `b722ba78`)	51/51 ✓
#2982 MiniMax M2.5 FMoE	`model_configs/a8w8_blockscale_tuned_fmoe_minimax-m2_5.csv` (relocated by `89e9e860`)	76/76 ✓
#2982 codegen	`csrc/ck_gemm_moe_2stages_codegen/gemm_moe_ck2stages_common.py`	6/6 adds + 2/2 removes ✓

Build-merge state

Merge group	Files	Total rows	Shape collisions
`a8w8_blockscale_tuned_gemm`	global + 3 model files	14,129	0
`tuned_fmoe`	global + 13 model files	—	0

Commit history (after relocations)

Commit	Author	Purpose
`96e0fb7b`	Aakif Nawaz	Restore #2979 MiniMax M2.5 GEMM tunings
`dca7c8ac`	Aakif Nawaz	Restore #2982 codegen change
`a08d8b38`	Aakif Nawaz	Restore #2982 MiniMax M2.5 FMoE entries
`fcfa67d5`	frida-andersson	Restore #2981 DSv3.2 net-new (223 GEMM + 57 FMoE)
`f52db878`	aliasger	GEMM cross-file dedup (#2981 vs ds_v3, lowest-`us` wins)
`66a88d87`	aliasger	FMoE cross-file dedup (5 files, lowest-`us` wins)
`89e9e860`	Aakif Nawaz	Move surviving MiniMax M2.5 FMoE entries to model-specific config
`b722ba78`	frida-andersson	Move surviving DSv3.2 GEMM/FMoE entries to model-specific configs

Where #2981/#2982 tunings competed with existing model-specific tunings (223 GEMM + 65 FMoE shapes), lowest-us wins. Every shape any source PR intended to tune still has a tuning in the merged build state — either from the source PR's row or a faster pre-existing one.

@valarLip — does this match the dedup intent of update_config_files? It's purely mechanical resolution of CI's auto-fix output, plus the relocations the original PR authors made on top. cc @frida-andersson @akii96.

…iles Same lowest-'us'-wins resolution as the GEMM dedup, applied to FMoE. The build's update_config_files asserts no shape collisions across merged tuned_fmoe files (key = untuned_fmoe.csv columns + cu_num + _tag); the additions in this PR introduced 65 cross-file collisions between tuned_fmoe.csv and 4 model_configs files. Resolution (best 'us' per shape): - tuned_fmoe.csv: 1080 -> 1039 rows (lost 41 to model files with better existing tunings — mostly 26 minimax + 10 glm47 + 4 ds_v3 + 1 qwen3_235b) - a8w8_blockscale_tuned_fmoe_ds_v3.csv: 16 -> 4 (12 superseded by #2981) - a8w8_blockscale_tuned_fmoe_minimax-m2_5.csv: 32 -> 26 (6 superseded by #2982) - a8w8_blockscale_tuned_fmoe_qwen3_235b.csv: 32 -> 28 (4 superseded by #2981) - glm47_fp8_tuned_fmoe.csv: 16 -> 14 (2 superseded by #2981) Every shape contributed by #2981 and #2982 remains covered post-dedup — where their row was not the winner, the existing model-specific tuning wins on its own merits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Keep the surviving MiniMax M25 FMoE rows in the model-specific config file instead of the global tuned_fmoe table.

Keep the surviving DeepSeek V3.2 tuning rows in model-specific config files instead of the global tuning tables.

Address likely MiniMax tuning contamination by moving the per-token GLM-4.7 FMoE entries back to the GLM model config where they belong.

Avoid prebuilding the 256x128x128x128 2x2 blockscale GEMM2 candidate, which exceeds the local memory limit during JIT compilation.

Cherry-pick of 1638f9e from main onto release/v0.1.13. Conflict in a8w8_blockscale_tuned_gemm_ds_v3.csv resolved by taking theirs. Original PR: #3024

* Add MiniMax M25 A8W8 blockscale GEMM tunings on gfx950 (splitK + AQRowMajor) * instances * Add MiniMax M25 FMoE cleaned entries * [configs] Add MI355X tuned GEMM and FMoE configs for DeepSeek-V3.2 Add gfx950 (MI355X, cu_num=256) tuning results for A8W8 block-scale GEMM and fused MoE kernels, optimized for DeepSeek-V3.2 shapes. GEMM (a8w8_blockscale_tuned_gemm.csv): - 6375 entries covering M=1..8192 for all DSv32 (N,K) shapes - Includes split-K tuned configs per shape (best of splitK=0 vs splitK>0) - Key decode (M=1) improvements: 128x7168 -59%, 7168x4096 -33% FMoE (tuned_fmoe.csv): - 802 cu_num=256 entries for DSv32 expert dimensions (N=512/4096/4608/7168, K=1536/7168/9216) - Replaces 751 previous cu_num=256 entries with re-tuned results - Existing cu_num=80 (MI300X) entries unchanged Made-with: Cursor (Cherry-picked from #2981 to restore content lost in bulk merge #3004. Net semantic effect of this PR vs current main: GEMM: +6375 / -5 FMoE: +57 / -7 The remaining 1042 of #2981's 1099 textual FMoE adds are content- identical reorderings already present on main.) * [configs] dedup colliding (M,N,K,cu_num,gfx) shapes between #2981 and ds_v3 #2981 added 223 DSv3.2 GEMM tunings to a8w8_blockscale_tuned_gemm.csv that share (M,N,K,cu_num,gfx) shape keys with the model-specific a8w8_blockscale_tuned_gemm_ds_v3.csv. The aiter build asserts no shape collisions across merged config files; resolve by keeping the lowest 'us' row per shape: - 187 ds_v3 rows dropped (superseded by #2981's better tunings) - 36 #2981 rows dropped (ds_v3's existing tunings were faster) Result: every conflicting shape still has a tuning, picked from whichever file had the better measurement. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [configs] dedup colliding FMoE shapes across global + model_configs files Same lowest-'us'-wins resolution as the GEMM dedup, applied to FMoE. The build's update_config_files asserts no shape collisions across merged tuned_fmoe files (key = untuned_fmoe.csv columns + cu_num + _tag); the additions in this PR introduced 65 cross-file collisions between tuned_fmoe.csv and 4 model_configs files. Resolution (best 'us' per shape): - tuned_fmoe.csv: 1080 -> 1039 rows (lost 41 to model files with better existing tunings — mostly 26 minimax + 10 glm47 + 4 ds_v3 + 1 qwen3_235b) - a8w8_blockscale_tuned_fmoe_ds_v3.csv: 16 -> 4 (12 superseded by #2981) - a8w8_blockscale_tuned_fmoe_minimax-m2_5.csv: 32 -> 26 (6 superseded by #2982) - a8w8_blockscale_tuned_fmoe_qwen3_235b.csv: 32 -> 28 (4 superseded by #2981) - glm47_fp8_tuned_fmoe.csv: 16 -> 14 (2 superseded by #2981) Every shape contributed by #2981 and #2982 remains covered post-dedup — where their row was not the winner, the existing model-specific tuning wins on its own merits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [configs] Move MiniMax FMoE tunings to model config Keep the surviving MiniMax M25 FMoE rows in the model-specific config file instead of the global tuned_fmoe table. * [configs] Move DeepSeek tunings to model configs Keep the surviving DeepSeek V3.2 tuning rows in model-specific config files instead of the global tuning tables. * [configs] Move GLM FMoE rows back to GLM config Address likely MiniMax tuning contamination by moving the per-token GLM-4.7 FMoE entries back to the GLM model config where they belong. * [moe] Disable blockscale GEMM2 instance that exceeds LDS Avoid prebuilding the 256x128x128x128 2x2 blockscale GEMM2 candidate, which exceeds the local memory limit during JIT compilation. * [moe] Disable risky/unused blockscale MoE instances --------- Co-authored-by: Aakif Nawaz <aaknawaz@amd.com> Co-authored-by: Aakif Nawaz <aakif.nawaz@amd.com> Co-authored-by: frida-andersson <fanderss@amd.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Aakif Nawaz and others added 3 commits May 4, 2026 20:18

Add MiniMax M25 A8W8 blockscale GEMM tunings on gfx950 (splitK + AQRo…

96e0fb7

…wMajor)

instances

dca7c8a

Add MiniMax M25 FMoE cleaned entries

a08d8b3

azaidy requested review from a team, akii96 and sunway513 May 4, 2026 20:28

azaidy closed this May 4, 2026

azaidy reopened this May 4, 2026

azaidy requested a review from valarLip May 4, 2026 20:56

azaidy force-pushed the fix/restore-minimax-m25-tunings branch from 065bef4 to fcfa67d Compare May 4, 2026 21:37

azaidy and others added 3 commits May 4, 2026 23:18

[configs] Move MiniMax FMoE tunings to model config

89e9e86

Keep the surviving MiniMax M25 FMoE rows in the model-specific config file instead of the global tuned_fmoe table.

[configs] Move DeepSeek tunings to model configs

b722ba7

Keep the surviving DeepSeek V3.2 tuning rows in model-specific config files instead of the global tuning tables.

valarLip reviewed May 5, 2026

View reviewed changes

Comment thread aiter/configs/model_configs/glm47_fp8_tuned_fmoe.csv

valarLip previously approved these changes May 5, 2026

View reviewed changes

[configs] Move GLM FMoE rows back to GLM config

75f3c2f

Address likely MiniMax tuning contamination by moving the per-token GLM-4.7 FMoE entries back to the GLM model config where they belong.

akii96 dismissed valarLip’s stale review via 75f3c2f May 5, 2026 05:32

akii96 requested a review from valarLip May 5, 2026 06:35

akii96 added 2 commits May 5, 2026 07:04

[moe] Disable blockscale GEMM2 instance that exceeds LDS

3acef70

Avoid prebuilding the 256x128x128x128 2x2 blockscale GEMM2 candidate, which exceeds the local memory limit during JIT compilation.

[moe] Disable risky/unused blockscale MoE instances

07cea8f

azaidy changed the title ~~Add configs missing from bulk merge #3004~~ [Silo] Add configs missing from bulk merge #3004 May 5, 2026

valarLip approved these changes May 5, 2026

View reviewed changes

sunway513 merged commit 1638f9e into main May 5, 2026
44 of 45 checks passed

sunway513 deleted the fix/restore-minimax-m25-tunings branch May 5, 2026 18:58

sunway513 added a commit that referenced this pull request May 5, 2026

[Silo] Add configs missing from bulk merge #3004 (#3024)

34f2c97

Cherry-pick of 1638f9e from main onto release/v0.1.13. Conflict in a8w8_blockscale_tuned_gemm_ds_v3.csv resolved by taking theirs. Original PR: #3024

This was referenced May 5, 2026

Add MiniMax M25 FMoE tunings #2982

Closed

Add MiniMax M25 A8W8 blockscale GEMM tunings #2979

Closed

wuhuikx mentioned this pull request May 6, 2026

Upgrade the aiter version to v0.1.13-rc4 vllm-project/vllm#41786

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Silo] Add configs missing from bulk merge #3004#3024

[Silo] Add configs missing from bulk merge #3004#3024
sunway513 merged 11 commits intomainfrom
fix/restore-minimax-m25-tunings

azaidy commented May 4, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

azaidy commented May 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

azaidy commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Source PR coverage (after this PR merges)

What this PR adds (vs current main)

Commits

Duplicate audit

Risk

Test plan

Uh oh!

github-actions Bot commented May 4, 2026

🏷️ CI Guide

Uh oh!

github-actions Bot commented May 4, 2026

🏷️ CI Guide

Uh oh!

azaidy commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Cross-file shape collision dedup applied (both GEMM and FMoE) — updated

Final state

Build-merge state

Commit history (after relocations)

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

azaidy commented May 4, 2026 •

edited

Loading

azaidy commented May 4, 2026 •

edited

Loading