Skip to content

[Silo] Add configs missing from bulk merge #3004#3024

Merged
sunway513 merged 11 commits intomainfrom
fix/restore-minimax-m25-tunings
May 5, 2026
Merged

[Silo] Add configs missing from bulk merge #3004#3024
sunway513 merged 11 commits intomainfrom
fix/restore-minimax-m25-tunings

Conversation

@azaidy
Copy link
Copy Markdown
Contributor

@azaidy azaidy commented May 4, 2026

Summary

Restores configs that were silently dropped from the bulk merge in #3004. The dedup commits in that PR overshot — #2979 and #2982 lost 100% of their content, and #2981 lost 223 GEMM rows + 57 FMoE rows.

After this PR merges, main will reflect every source PR originally listed in #3004.

Source PR coverage (after this PR merges)

Source PR File Semantic adds Semantic removes Where it lands
#2923 GLM-4.7 model_configs/glm47_fp8_tuned_fmoe.csv 16/16 ✓ 0/0 ✓ already on main from #3004
#2923 GLM-4.7 model_configs/glm47_fp8_untuned_fmoe.csv 16/16 ✓ 0/0 ✓ already on main from #3004
#2938 Kimi K2.5 model_configs/kimik2_fp4_tuned_fmoe.csv 160/160 ✓ 128/128 ✓ already on main from #3004
#2979 MiniMax M2.5 GEMM a8w8_blockscale_tuned_gemm.csv 6630/6630 ✓ 0/0 ✓ this PR (was 0% on main)
#2981 DeepSeek-V3.2 a8w8_blockscale_tuned_gemm.csv 6375/6375 ✓ 5/5 ✓ 6152 GEMM rows already on main + 223 net-new in this PR
#2981 DeepSeek-V3.2 tuned_fmoe.csv 57/57 ✓ 7/7 ✓ 1042 reorderings already on main + 57 net-new in this PR
#2982 MiniMax M2.5 FMoE tuned_fmoe.csv 76/76 ✓ 0/0 ✓ this PR (was 0% on main)
#2982 MiniMax M2.5 FMoE csrc/ck_gemm_moe_2stages_codegen/gemm_moe_ck2stages_common.py 6/6 ✓ 2/2 ✓ this PR (was 0% on main)

Counts are computed as semantic set deltas (PR base set vs PR HEAD set), so reorderings don't inflate them.

What this PR adds (vs current main)

  • aiter/configs/a8w8_blockscale_tuned_gemm.csv — +6853 / -0
  • aiter/configs/tuned_fmoe.csv — +133 / -0
  • csrc/ck_gemm_moe_2stages_codegen/gemm_moe_ck2stages_common.py — +6 / -2

Commits

Original authorship preserved via cherry-pick:

Duplicate audit

File Rows Exact-row dup Shape dup
a8w8_blockscale_tuned_gemm.csv 13005 0 0
tuned_fmoe.csv 1080 0 0

@sunway513's dedup intent from #3004 is preserved — no conflicting shapes are added. The 432 cu_num=80 shape duplicates that pre-exist in tuned_fmoe.csv on main are out of scope for this PR (leftover from the earlier dedup pass).

Risk

Low — CSV configs and a small codegen change (already used elsewhere in #2982). No kernel code, no API changes.

Test plan

  • CI passes
  • Spot check: ATOM loads new configs without error for MiniMax M2.5 and DeepSeek-V3.2

@azaidy azaidy requested review from a team, akii96 and sunway513 May 4, 2026 20:28
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 4, 2026

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-300x Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
ci:sglang SGLang integration tests
ci:atom ATOM benchmark (DeepSeek-R1 + GPT-OSS)
ci:vllm vLLM benchmark
ci:all All of the above

Add labels via the sidebar or gh pr edit 3024 --add-label <label>

@azaidy azaidy closed this May 4, 2026
@azaidy azaidy reopened this May 4, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 4, 2026

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-300x Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
ci:sglang SGLang integration tests
ci:atom ATOM benchmark (DeepSeek-R1 + GPT-OSS)
ci:vllm vLLM benchmark
ci:all All of the above

Add labels via the sidebar or gh pr edit 3024 --add-label <label>

@azaidy azaidy requested a review from valarLip May 4, 2026 20:56
Add gfx950 (MI355X, cu_num=256) tuning results for A8W8 block-scale
GEMM and fused MoE kernels, optimized for DeepSeek-V3.2 shapes.

GEMM (a8w8_blockscale_tuned_gemm.csv):
- 6375 entries covering M=1..8192 for all DSv32 (N,K) shapes
- Includes split-K tuned configs per shape (best of splitK=0 vs splitK>0)
- Key decode (M=1) improvements: 128x7168 -59%, 7168x4096 -33%

FMoE (tuned_fmoe.csv):
- 802 cu_num=256 entries for DSv32 expert dimensions
  (N=512/4096/4608/7168, K=1536/7168/9216)
- Replaces 751 previous cu_num=256 entries with re-tuned results
- Existing cu_num=80 (MI300X) entries unchanged

Made-with: Cursor

(Cherry-picked from #2981 to restore content lost in bulk merge #3004.
Net semantic effect of this PR vs current main:
  GEMM: +6375 / -5
  FMoE: +57 / -7
The remaining 1042 of #2981's 1099 textual FMoE adds are content-
identical reorderings already present on main.)
@azaidy azaidy force-pushed the fix/restore-minimax-m25-tunings branch from 065bef4 to fcfa67d Compare May 4, 2026 21:37
… ds_v3

#2981 added 223 DSv3.2 GEMM tunings to a8w8_blockscale_tuned_gemm.csv
that share (M,N,K,cu_num,gfx) shape keys with the model-specific
a8w8_blockscale_tuned_gemm_ds_v3.csv. The aiter build asserts no shape
collisions across merged config files; resolve by keeping the lowest
'us' row per shape:

- 187 ds_v3 rows dropped (superseded by #2981's better tunings)
- 36 #2981 rows dropped (ds_v3's existing tunings were faster)

Result: every conflicting shape still has a tuning, picked from
whichever file had the better measurement.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@azaidy
Copy link
Copy Markdown
Contributor Author

azaidy commented May 4, 2026

Cross-file shape collision dedup applied (both GEMM and FMoE) — updated

Update: Originally I posted that I'd applied the lowest-us-wins dedup to resolve cross-file collisions. Since then, @frida-andersson (b722ba78) and @akii96 (89e9e860) pushed commits that move the surviving DSv3.2 and MiniMax M2.5 tunings out of the global a8w8_blockscale_tuned_gemm.csv / tuned_fmoe.csv and into their proper model_configs/ files. That cleanup is the right place for them to live and naturally resolves the collisions on top of the dedup.

Final state

Source PR Now lives in Coverage
#2923 GLM-4.7 model_configs/glm47_fp8_tuned_fmoe.csv 16/16 ✓
#2938 Kimi K2.5 model_configs/kimik2_fp4_tuned_fmoe.csv 32/32 ✓
#2979 MiniMax M2.5 GEMM global a8w8_blockscale_tuned_gemm.csv 6630/6630 ✓
#2981 DSv3.2 GEMM model_configs/a8w8_blockscale_tuned_gemm_ds_v3.csv (relocated by b722ba78) 6370/6370 ✓
#2981 DSv3.2 FMoE model_configs/a8w8_blockscale_tuned_fmoe_ds_v3.csv (relocated by b722ba78) 51/51 ✓
#2982 MiniMax M2.5 FMoE model_configs/a8w8_blockscale_tuned_fmoe_minimax-m2_5.csv (relocated by 89e9e860) 76/76 ✓
#2982 codegen csrc/ck_gemm_moe_2stages_codegen/gemm_moe_ck2stages_common.py 6/6 adds + 2/2 removes ✓

Build-merge state

Merge group Files Total rows Shape collisions
a8w8_blockscale_tuned_gemm global + 3 model files 14,129 0
tuned_fmoe global + 13 model files 0

Commit history (after relocations)

Commit Author Purpose
96e0fb7b Aakif Nawaz Restore #2979 MiniMax M2.5 GEMM tunings
dca7c8ac Aakif Nawaz Restore #2982 codegen change
a08d8b38 Aakif Nawaz Restore #2982 MiniMax M2.5 FMoE entries
fcfa67d5 frida-andersson Restore #2981 DSv3.2 net-new (223 GEMM + 57 FMoE)
f52db878 aliasger GEMM cross-file dedup (#2981 vs ds_v3, lowest-us wins)
66a88d87 aliasger FMoE cross-file dedup (5 files, lowest-us wins)
89e9e860 Aakif Nawaz Move surviving MiniMax M2.5 FMoE entries to model-specific config
b722ba78 frida-andersson Move surviving DSv3.2 GEMM/FMoE entries to model-specific configs

Where #2981/#2982 tunings competed with existing model-specific tunings (223 GEMM + 65 FMoE shapes), lowest-us wins. Every shape any source PR intended to tune still has a tuning in the merged build state — either from the source PR's row or a faster pre-existing one.

@valarLip — does this match the dedup intent of update_config_files? It's purely mechanical resolution of CI's auto-fix output, plus the relocations the original PR authors made on top. cc @frida-andersson @akii96.

azaidy and others added 3 commits May 4, 2026 23:18
…iles

Same lowest-'us'-wins resolution as the GEMM dedup, applied to FMoE.
The build's update_config_files asserts no shape collisions across
merged tuned_fmoe files (key = untuned_fmoe.csv columns + cu_num + _tag);
the additions in this PR introduced 65 cross-file collisions between
tuned_fmoe.csv and 4 model_configs files.

Resolution (best 'us' per shape):
- tuned_fmoe.csv: 1080 -> 1039 rows (lost 41 to model files with better
  existing tunings — mostly 26 minimax + 10 glm47 + 4 ds_v3 + 1 qwen3_235b)
- a8w8_blockscale_tuned_fmoe_ds_v3.csv: 16 -> 4 (12 superseded by #2981)
- a8w8_blockscale_tuned_fmoe_minimax-m2_5.csv: 32 -> 26 (6 superseded by #2982)
- a8w8_blockscale_tuned_fmoe_qwen3_235b.csv: 32 -> 28 (4 superseded by #2981)
- glm47_fp8_tuned_fmoe.csv: 16 -> 14 (2 superseded by #2981)

Every shape contributed by #2981 and #2982 remains covered post-dedup —
where their row was not the winner, the existing model-specific tuning
wins on its own merits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Keep the surviving MiniMax M25 FMoE rows in the model-specific config
file instead of the global tuned_fmoe table.
Keep the surviving DeepSeek V3.2 tuning rows in model-specific config
files instead of the global tuning tables.
Comment thread aiter/configs/model_configs/glm47_fp8_tuned_fmoe.csv
valarLip
valarLip previously approved these changes May 5, 2026
Address likely MiniMax tuning contamination by moving the per-token
GLM-4.7 FMoE entries back to the GLM model config where they belong.
akii96 added 2 commits May 5, 2026 07:04
Avoid prebuilding the 256x128x128x128 2x2 blockscale GEMM2
candidate, which exceeds the local memory limit during JIT compilation.
@azaidy azaidy changed the title Add configs missing from bulk merge #3004 [Silo] Add configs missing from bulk merge #3004 May 5, 2026
@sunway513 sunway513 merged commit 1638f9e into main May 5, 2026
44 of 45 checks passed
@sunway513 sunway513 deleted the fix/restore-minimax-m25-tunings branch May 5, 2026 18:58
sunway513 added a commit that referenced this pull request May 5, 2026
Cherry-pick of 1638f9e from main onto release/v0.1.13.
Conflict in a8w8_blockscale_tuned_gemm_ds_v3.csv resolved by taking theirs.

Original PR: #3024
Liang-jianhao97 pushed a commit that referenced this pull request May 7, 2026
* Add MiniMax M25 A8W8 blockscale GEMM tunings on gfx950 (splitK + AQRowMajor)

* instances

* Add MiniMax M25 FMoE cleaned entries

* [configs] Add MI355X tuned GEMM and FMoE configs for DeepSeek-V3.2

Add gfx950 (MI355X, cu_num=256) tuning results for A8W8 block-scale
GEMM and fused MoE kernels, optimized for DeepSeek-V3.2 shapes.

GEMM (a8w8_blockscale_tuned_gemm.csv):
- 6375 entries covering M=1..8192 for all DSv32 (N,K) shapes
- Includes split-K tuned configs per shape (best of splitK=0 vs splitK>0)
- Key decode (M=1) improvements: 128x7168 -59%, 7168x4096 -33%

FMoE (tuned_fmoe.csv):
- 802 cu_num=256 entries for DSv32 expert dimensions
  (N=512/4096/4608/7168, K=1536/7168/9216)
- Replaces 751 previous cu_num=256 entries with re-tuned results
- Existing cu_num=80 (MI300X) entries unchanged

Made-with: Cursor

(Cherry-picked from #2981 to restore content lost in bulk merge #3004.
Net semantic effect of this PR vs current main:
  GEMM: +6375 / -5
  FMoE: +57 / -7
The remaining 1042 of #2981's 1099 textual FMoE adds are content-
identical reorderings already present on main.)

* [configs] dedup colliding (M,N,K,cu_num,gfx) shapes between #2981 and ds_v3

#2981 added 223 DSv3.2 GEMM tunings to a8w8_blockscale_tuned_gemm.csv
that share (M,N,K,cu_num,gfx) shape keys with the model-specific
a8w8_blockscale_tuned_gemm_ds_v3.csv. The aiter build asserts no shape
collisions across merged config files; resolve by keeping the lowest
'us' row per shape:

- 187 ds_v3 rows dropped (superseded by #2981's better tunings)
- 36 #2981 rows dropped (ds_v3's existing tunings were faster)

Result: every conflicting shape still has a tuning, picked from
whichever file had the better measurement.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [configs] dedup colliding FMoE shapes across global + model_configs files

Same lowest-'us'-wins resolution as the GEMM dedup, applied to FMoE.
The build's update_config_files asserts no shape collisions across
merged tuned_fmoe files (key = untuned_fmoe.csv columns + cu_num + _tag);
the additions in this PR introduced 65 cross-file collisions between
tuned_fmoe.csv and 4 model_configs files.

Resolution (best 'us' per shape):
- tuned_fmoe.csv: 1080 -> 1039 rows (lost 41 to model files with better
  existing tunings — mostly 26 minimax + 10 glm47 + 4 ds_v3 + 1 qwen3_235b)
- a8w8_blockscale_tuned_fmoe_ds_v3.csv: 16 -> 4 (12 superseded by #2981)
- a8w8_blockscale_tuned_fmoe_minimax-m2_5.csv: 32 -> 26 (6 superseded by #2982)
- a8w8_blockscale_tuned_fmoe_qwen3_235b.csv: 32 -> 28 (4 superseded by #2981)
- glm47_fp8_tuned_fmoe.csv: 16 -> 14 (2 superseded by #2981)

Every shape contributed by #2981 and #2982 remains covered post-dedup —
where their row was not the winner, the existing model-specific tuning
wins on its own merits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [configs] Move MiniMax FMoE tunings to model config
Keep the surviving MiniMax M25 FMoE rows in the model-specific config
file instead of the global tuned_fmoe table.

* [configs] Move DeepSeek tunings to model configs
Keep the surviving DeepSeek V3.2 tuning rows in model-specific config
files instead of the global tuning tables.

* [configs] Move GLM FMoE rows back to GLM config

Address likely MiniMax tuning contamination by moving the per-token
GLM-4.7 FMoE entries back to the GLM model config where they belong.

* [moe] Disable blockscale GEMM2 instance that exceeds LDS
Avoid prebuilding the 256x128x128x128 2x2 blockscale GEMM2
candidate, which exceeds the local memory limit during JIT compilation.

* [moe] Disable risky/unused blockscale MoE instances

---------

Co-authored-by: Aakif Nawaz <aaknawaz@amd.com>
Co-authored-by: Aakif Nawaz <aakif.nawaz@amd.com>
Co-authored-by: frida-andersson <fanderss@amd.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants