Skip to content

[Silo] Bulk merge: tuned GEMM and FMoE configs (GLM-4.7, Kimi-K2.5, MiniMax-M2.5, DeepSeek-V3.2)#3004

Merged
sunway513 merged 13 commits intomainfrom
silo/v0.1.13-configs
May 2, 2026
Merged

[Silo] Bulk merge: tuned GEMM and FMoE configs (GLM-4.7, Kimi-K2.5, MiniMax-M2.5, DeepSeek-V3.2)#3004
sunway513 merged 13 commits intomainfrom
silo/v0.1.13-configs

Conversation

@sunway513
Copy link
Copy Markdown
Collaborator

Summary

Bulk merge of 5 Silo config-only PRs for v0.1.13 deadline (vLLM 0.21 freeze 2026-05-08).

Pure CSV tuning configs — no code changes.

Included PRs

Files Changed

All under aiter/configs/:

  • a8w8_blockscale_tuned_gemm.csv
  • tuned_fmoe.csv
  • model_configs/glm47_fp8_tuned_fmoe.csv (new)
  • model_configs/glm47_fp8_untuned_fmoe.csv (new)
  • model_configs/kimik2_fp4_tuned_fmoe.csv

Risk

Low — CSV config files only. No kernel code, no API changes, no build changes.

Test Plan

  • CI passes (no code changes, config load only)
  • Spot check: ATOM loads new configs without error

omirosh and others added 10 commits April 27, 2026 12:16
Add gfx950 (MI355X, cu_num=256) tuning results for A8W8 block-scale
GEMM and fused MoE kernels, optimized for DeepSeek-V3.2 shapes.

GEMM (a8w8_blockscale_tuned_gemm.csv):
- 6375 entries covering M=1..8192 for all DSv32 (N,K) shapes
- Includes split-K tuned configs per shape (best of splitK=0 vs splitK>0)
- Key decode (M=1) improvements: 128x7168 -59%, 7168x4096 -33%

FMoE (tuned_fmoe.csv):
- 802 cu_num=256 entries for DSv32 expert dimensions
  (N=512/4096/4608/7168, K=1536/7168/9216)
- Replaces 751 previous cu_num=256 entries with re-tuned results
- Existing cu_num=80 (MI300X) entries unchanged

Made-with: Cursor
@sunway513 sunway513 requested review from a team and Copilot May 1, 2026 23:30
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 1, 2026

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-300x Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
ci:sglang SGLang integration tests
ci:atom ATOM benchmark (DeepSeek-R1 + GPT-OSS)
ci:vllm vLLM benchmark
ci:all All of the above

Add labels via the sidebar or gh pr edit 3004 --add-label <label>

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Bulk merge of multiple Silo “config-only” PRs, adding/updating tuned GEMM and fused-MoE (FMoE) CSV tables under aiter/configs/ (and aiter/configs/model_configs/) to improve kernel selection for specific models/hardware targets (e.g., MI355X / cu_num=256).

Changes:

  • Update model-specific tuned FMoE table for Kimi K2.5 FP4 (kimik2_fp4_tuned_fmoe.csv).
  • Add GLM-4.7 FP8 tuned + untuned FMoE shape tables (glm47_fp8_{tuned,untuned}_fmoe.csv).
  • (Per PR description) bulk-update shared tuned tables for GEMM/FMoE under aiter/configs/.

Reviewed changes

Copilot reviewed 2 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
aiter/configs/a8w8_blockscale_tuned_gemm.csv (Per PR description) bulk tuning table updates for A8W8 blockscale GEMMs.
aiter/configs/tuned_fmoe.csv (Per PR description) bulk tuning table updates for fused MoE kernel selection.
aiter/configs/model_configs/glm47_fp8_tuned_fmoe.csv New GLM-4.7 FP8 tuned FMoE configs (cu_num=256).
aiter/configs/model_configs/glm47_fp8_untuned_fmoe.csv New GLM-4.7 FP8 untuned reference shapes (input list for tuning).
aiter/configs/model_configs/kimik2_fp4_tuned_fmoe.csv Updated Kimi K2.5 FP4 tuned FMoE configs (cu_num=256).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +2 to +14
256,1,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0,32,0,24.9944,moe_ck2stages_gemm1_256x32x64x256_1x4_TypeCast_v1_Nswizzle0_Quant2_MulRoutedWeight0_silu_F8_F8_B16,0.00%,20.3787,moe_ck2stages_gemm2_256x32x128x256_1x4_TypeCast_v1_Nswizzle0_Quant2_MulRoutedWeight1_F8_F8_B16,3.80%,45.3731,0,8.32,20799.41
256,2,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0,32,0,40.11,_ZN5aiter48fmoe_stage1_bf16_pertokenFp8_g1u1_32x128_3tg_pf3E,0.00%,28.5666,moe_ck2stages_gemm2_256x32x128x256_1x4_TypeCast_v1_Nswizzle0_Quant2_MulRoutedWeight1_F8_F8_B16,3.80%,68.6766,0,10.99,13741.93
256,4,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0,32,0,99.8134,_ZN5aiter45fmoe_bf16_pertokenFp8_g1u1_vs_silu_1tg_32x192E,0.00%,0,,0.00%,99.8134,1,15.13,9455.44
256,8,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0,32,0,134.9101,_ZN5aiter45fmoe_bf16_pertokenFp8_g1u1_vs_silu_1tg_32x256E,0.00%,0,,0.00%,134.9101,1,22.38,6996.08
256,16,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0,32,0,156.6731,_ZN5aiter45fmoe_bf16_pertokenFp8_g1u1_vs_silu_1tg_32x256E,0.00%,0,,0.00%,156.6731,1,38.55,6025.06
256,32,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0,32,0,164.2209,_ZN5aiter45fmoe_bf16_pertokenFp8_g1u1_vs_silu_1tg_32x256E,0.00%,0,,0.00%,164.2209,1,73.56,5749.63
256,64,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0,32,0,168.1699,_ZN5aiter45fmoe_bf16_pertokenFp8_g1u1_vs_silu_1tg_32x256E,0.00%,0,,0.00%,168.1699,1,143.66,5617.54
256,128,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0,32,0,177.1016,_ZN5aiter48fmoe_bf16_pertokenFp8_g1u1_vs_silu_1tg_ps_32x384E,0.00%,0,,0.00%,177.1016,1,272.83,5339.79
256,256,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0,64,0,111.1445,_ZN5aiter48fmoe_stage1_bf16_pertokenFp8_g1u1_64x128_2tg_pf3E,0.00%,94.3821,moe_ck2stages_gemm2_256x64x128x256_1x4_TypeCast_v1_Nswizzle0_Quant2_MulRoutedWeight1_F8_F8_B16,3.80%,205.5266,0,470.19,4610.84
256,512,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0,128,0,128.617,_ZN5aiter45fmoe_stage1_bf16_pertokenFp8_g1u1_128x128_pf3E,0.00%,136.7373,moe_ck2stages_gemm2_256x128x128x128_1x4_TypeCast_v3_Nswizzle0_Quant2_MulRoutedWeight1_F8_F8_B16,3.80%,265.3543,0,728.36,3586.08
256,1024,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0,64,0,227.447,moe_ck2stages_gemm1_256x64x64x256_1x4_TypeCast_v1_Nswizzle0_Quant2_MulRoutedWeight0_silu_F8_F8_B16,0.00%,192.093,moe_ck2stages_gemm2_256x64x128x256_1x4_TypeCast_v1_Nswizzle0_Quant2_MulRoutedWeight1_F8_F8_B16,3.80%,419.54,0,921.36,2286.9
256,2048,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0,128,0,352.8768,moe_ck2stages_gemm1_256x128x64x128_1x4_TypeCast_v1_Nswizzle0_Quant2_MulRoutedWeight0_silu_F8_F8_B16,0.00%,345.4655,moe_ck2stages_gemm2_256x128x128x128_1x4_TypeCast_v3_Nswizzle0_Quant2_MulRoutedWeight1_F8_F8_B16,3.70%,698.3423,0,1107.04,1396.42
256,4096,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0,128,0,582.0723,moe_ck2stages_gemm1_256x128x128x128_1x4_TypeCast_v3_Nswizzle0_Quant2_MulRoutedWeight0_silu_F8_F8_B16,0.00%,641.9346,moe_ck2stages_gemm2_256x128x128x128_1x4_TypeCast_v3_Nswizzle0_Quant2_MulRoutedWeight1_F8_F8_B16,3.80%,1224.0069,0,1263.22,822.41
@sunway513 sunway513 merged commit 52c4554 into main May 2, 2026
34 of 35 checks passed
@sunway513 sunway513 deleted the silo/v0.1.13-configs branch May 2, 2026 03:16
chun-wan pushed a commit that referenced this pull request May 4, 2026
…iniMax-M2.5, DeepSeek-V3.2) (#3004)

* Add GLM-4.7 FP8 tuned and untuned FMOE configs

* Added MI355X MoE tunings for Kimi-K2 FP4 TP2

* Add MiniMax M25 A8W8 blockscale GEMM tunings

* [configs] Add MI355X tuned GEMM and FMoE configs for DeepSeek-V3.2

Add gfx950 (MI355X, cu_num=256) tuning results for A8W8 block-scale
GEMM and fused MoE kernels, optimized for DeepSeek-V3.2 shapes.

GEMM (a8w8_blockscale_tuned_gemm.csv):
- 6375 entries covering M=1..8192 for all DSv32 (N,K) shapes
- Includes split-K tuned configs per shape (best of splitK=0 vs splitK>0)
- Key decode (M=1) improvements: 128x7168 -59%, 7168x4096 -33%

FMoE (tuned_fmoe.csv):
- 802 cu_num=256 entries for DSv32 expert dimensions
  (N=512/4096/4608/7168, K=1536/7168/9216)
- Replaces 751 previous cu_num=256 entries with re-tuned results
- Existing cu_num=80 (MI300X) entries unchanged

Made-with: Cursor

* Add MiniMax M25 FMoE tunings

* fix: dedup 1692 duplicate entries in tuned_fmoe.csv from merge

* fix: remove 446 shapes from main CSV that duplicate ds_v3 model config

* fix: remove FMoE shapes that duplicate model-specific configs

---------

Co-authored-by: Olga Miroshnichenko <olga.miroshnichenko@amd.com>
Co-authored-by: Xavier Aguilar <xavier.aguilarfruto@amd.com>
Co-authored-by: Aakif Nawaz <aakif.nawaz@amd.com>
Co-authored-by: frida-andersson <fanderss@amd.com>
sunway513 added a commit that referenced this pull request May 4, 2026
Squash-merged from main commit 52c4554.

Includes 5 atomic Silo PRs:
- #2923 GLM-4.7 FP8 tuned/untuned FMoE configs (new)
- #2938 Kimi-K2.5 FP4 fused MoE tunings (TP2 / 256 CU refresh)
- #2979 MiniMax-M2.5 A8W8 blockscale GEMM tunings
- #2981 DeepSeek-V3.2 MI355X tuned GEMM and FMoE configs
- #2982 MiniMax-M2.5 FMoE tunings

Conflict in aiter/configs/model_configs/kimik2_fp4_tuned_fmoe.csv:
two blocks resolved by taking theirs (Silo). Block 1 upgrades existing
M=256/N=512 rows from base kernel suffixes (w3) to tuner-discovered
variants (w3_xcd4, _bnt2_persist, _sbm32, _sbm64). Block 2 is purely
additive: 30+ new rows for previously-uncovered N=7168/K=1024 shapes
plus a flydsl_fallback section.

Driver: vLLM 0.21 freeze 2026-05-08 — Silo customers need these tunings
on the AITER release wheel, not nightly.

Verification gate before tag:
- Kernel suffix parser smoke (Kimi-K2.5-MXFP4 1-token inference,
  confirm new suffixes JIT-compile without falling back)
- ATOM 5-model accuracy unchanged within +/- 0.005 vs v0.1.13-rc1
- Perf delta on Kimi-K2.5 / MiniMax-M2.5 / DSv3.2 (expect flat or better)

(cherry picked from commit 52c4554)
@akii96
Copy link
Copy Markdown
Contributor

akii96 commented May 4, 2026

Follow-up note for #3004: please see my comment on #2982 here: #2982 (comment)

The #2982 branch includes CK two-stage instance additions that were not included in this config-only bulk merge.

I was not sure if a seperate PR was needed just for a few additions so I included it in my tunings PR. Apolgies I understand now it was intended to be a pure config only PR, let me know if you need a separate PR on the instances

azaidy pushed a commit that referenced this pull request May 4, 2026
Add gfx950 (MI355X, cu_num=256) tuning results for A8W8 block-scale
GEMM and fused MoE kernels, optimized for DeepSeek-V3.2 shapes.

GEMM (a8w8_blockscale_tuned_gemm.csv):
- 6375 entries covering M=1..8192 for all DSv32 (N,K) shapes
- Includes split-K tuned configs per shape (best of splitK=0 vs splitK>0)
- Key decode (M=1) improvements: 128x7168 -59%, 7168x4096 -33%

FMoE (tuned_fmoe.csv):
- 802 cu_num=256 entries for DSv32 expert dimensions
  (N=512/4096/4608/7168, K=1536/7168/9216)
- Replaces 751 previous cu_num=256 entries with re-tuned results
- Existing cu_num=80 (MI300X) entries unchanged

Made-with: Cursor

(Cherry-picked from #2981 to restore content lost in bulk merge #3004.
Net semantic effect of this PR vs current main:
  GEMM: +6375 / -5
  FMoE: +57 / -7
The remaining 1042 of #2981's 1099 textual FMoE adds are content-
identical reorderings already present on main.)
sunway513 pushed a commit that referenced this pull request May 5, 2026
* Add MiniMax M25 A8W8 blockscale GEMM tunings on gfx950 (splitK + AQRowMajor)

* instances

* Add MiniMax M25 FMoE cleaned entries

* [configs] Add MI355X tuned GEMM and FMoE configs for DeepSeek-V3.2

Add gfx950 (MI355X, cu_num=256) tuning results for A8W8 block-scale
GEMM and fused MoE kernels, optimized for DeepSeek-V3.2 shapes.

GEMM (a8w8_blockscale_tuned_gemm.csv):
- 6375 entries covering M=1..8192 for all DSv32 (N,K) shapes
- Includes split-K tuned configs per shape (best of splitK=0 vs splitK>0)
- Key decode (M=1) improvements: 128x7168 -59%, 7168x4096 -33%

FMoE (tuned_fmoe.csv):
- 802 cu_num=256 entries for DSv32 expert dimensions
  (N=512/4096/4608/7168, K=1536/7168/9216)
- Replaces 751 previous cu_num=256 entries with re-tuned results
- Existing cu_num=80 (MI300X) entries unchanged

Made-with: Cursor

(Cherry-picked from #2981 to restore content lost in bulk merge #3004.
Net semantic effect of this PR vs current main:
  GEMM: +6375 / -5
  FMoE: +57 / -7
The remaining 1042 of #2981's 1099 textual FMoE adds are content-
identical reorderings already present on main.)

* [configs] dedup colliding (M,N,K,cu_num,gfx) shapes between #2981 and ds_v3

#2981 added 223 DSv3.2 GEMM tunings to a8w8_blockscale_tuned_gemm.csv
that share (M,N,K,cu_num,gfx) shape keys with the model-specific
a8w8_blockscale_tuned_gemm_ds_v3.csv. The aiter build asserts no shape
collisions across merged config files; resolve by keeping the lowest
'us' row per shape:

- 187 ds_v3 rows dropped (superseded by #2981's better tunings)
- 36 #2981 rows dropped (ds_v3's existing tunings were faster)

Result: every conflicting shape still has a tuning, picked from
whichever file had the better measurement.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [configs] dedup colliding FMoE shapes across global + model_configs files

Same lowest-'us'-wins resolution as the GEMM dedup, applied to FMoE.
The build's update_config_files asserts no shape collisions across
merged tuned_fmoe files (key = untuned_fmoe.csv columns + cu_num + _tag);
the additions in this PR introduced 65 cross-file collisions between
tuned_fmoe.csv and 4 model_configs files.

Resolution (best 'us' per shape):
- tuned_fmoe.csv: 1080 -> 1039 rows (lost 41 to model files with better
  existing tunings — mostly 26 minimax + 10 glm47 + 4 ds_v3 + 1 qwen3_235b)
- a8w8_blockscale_tuned_fmoe_ds_v3.csv: 16 -> 4 (12 superseded by #2981)
- a8w8_blockscale_tuned_fmoe_minimax-m2_5.csv: 32 -> 26 (6 superseded by #2982)
- a8w8_blockscale_tuned_fmoe_qwen3_235b.csv: 32 -> 28 (4 superseded by #2981)
- glm47_fp8_tuned_fmoe.csv: 16 -> 14 (2 superseded by #2981)

Every shape contributed by #2981 and #2982 remains covered post-dedup —
where their row was not the winner, the existing model-specific tuning
wins on its own merits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [configs] Move MiniMax FMoE tunings to model config
Keep the surviving MiniMax M25 FMoE rows in the model-specific config
file instead of the global tuned_fmoe table.

* [configs] Move DeepSeek tunings to model configs
Keep the surviving DeepSeek V3.2 tuning rows in model-specific config
files instead of the global tuning tables.

* [configs] Move GLM FMoE rows back to GLM config

Address likely MiniMax tuning contamination by moving the per-token
GLM-4.7 FMoE entries back to the GLM model config where they belong.

* [moe] Disable blockscale GEMM2 instance that exceeds LDS
Avoid prebuilding the 256x128x128x128 2x2 blockscale GEMM2
candidate, which exceeds the local memory limit during JIT compilation.

* [moe] Disable risky/unused blockscale MoE instances

---------

Co-authored-by: Aakif Nawaz <aaknawaz@amd.com>
Co-authored-by: Aakif Nawaz <aakif.nawaz@amd.com>
Co-authored-by: frida-andersson <fanderss@amd.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
sunway513 added a commit that referenced this pull request May 5, 2026
Cherry-pick of 1638f9e from main onto release/v0.1.13.
Conflict in a8w8_blockscale_tuned_gemm_ds_v3.csv resolved by taking theirs.

Original PR: #3024
Liang-jianhao97 pushed a commit that referenced this pull request May 7, 2026
…iniMax-M2.5, DeepSeek-V3.2) (#3004)

* Add GLM-4.7 FP8 tuned and untuned FMOE configs

* Added MI355X MoE tunings for Kimi-K2 FP4 TP2

* Add MiniMax M25 A8W8 blockscale GEMM tunings

* [configs] Add MI355X tuned GEMM and FMoE configs for DeepSeek-V3.2

Add gfx950 (MI355X, cu_num=256) tuning results for A8W8 block-scale
GEMM and fused MoE kernels, optimized for DeepSeek-V3.2 shapes.

GEMM (a8w8_blockscale_tuned_gemm.csv):
- 6375 entries covering M=1..8192 for all DSv32 (N,K) shapes
- Includes split-K tuned configs per shape (best of splitK=0 vs splitK>0)
- Key decode (M=1) improvements: 128x7168 -59%, 7168x4096 -33%

FMoE (tuned_fmoe.csv):
- 802 cu_num=256 entries for DSv32 expert dimensions
  (N=512/4096/4608/7168, K=1536/7168/9216)
- Replaces 751 previous cu_num=256 entries with re-tuned results
- Existing cu_num=80 (MI300X) entries unchanged

Made-with: Cursor

* Add MiniMax M25 FMoE tunings

* fix: dedup 1692 duplicate entries in tuned_fmoe.csv from merge

* fix: remove 446 shapes from main CSV that duplicate ds_v3 model config

* fix: remove FMoE shapes that duplicate model-specific configs

---------

Co-authored-by: Olga Miroshnichenko <olga.miroshnichenko@amd.com>
Co-authored-by: Xavier Aguilar <xavier.aguilarfruto@amd.com>
Co-authored-by: Aakif Nawaz <aakif.nawaz@amd.com>
Co-authored-by: frida-andersson <fanderss@amd.com>
Liang-jianhao97 pushed a commit that referenced this pull request May 7, 2026
* Add MiniMax M25 A8W8 blockscale GEMM tunings on gfx950 (splitK + AQRowMajor)

* instances

* Add MiniMax M25 FMoE cleaned entries

* [configs] Add MI355X tuned GEMM and FMoE configs for DeepSeek-V3.2

Add gfx950 (MI355X, cu_num=256) tuning results for A8W8 block-scale
GEMM and fused MoE kernels, optimized for DeepSeek-V3.2 shapes.

GEMM (a8w8_blockscale_tuned_gemm.csv):
- 6375 entries covering M=1..8192 for all DSv32 (N,K) shapes
- Includes split-K tuned configs per shape (best of splitK=0 vs splitK>0)
- Key decode (M=1) improvements: 128x7168 -59%, 7168x4096 -33%

FMoE (tuned_fmoe.csv):
- 802 cu_num=256 entries for DSv32 expert dimensions
  (N=512/4096/4608/7168, K=1536/7168/9216)
- Replaces 751 previous cu_num=256 entries with re-tuned results
- Existing cu_num=80 (MI300X) entries unchanged

Made-with: Cursor

(Cherry-picked from #2981 to restore content lost in bulk merge #3004.
Net semantic effect of this PR vs current main:
  GEMM: +6375 / -5
  FMoE: +57 / -7
The remaining 1042 of #2981's 1099 textual FMoE adds are content-
identical reorderings already present on main.)

* [configs] dedup colliding (M,N,K,cu_num,gfx) shapes between #2981 and ds_v3

#2981 added 223 DSv3.2 GEMM tunings to a8w8_blockscale_tuned_gemm.csv
that share (M,N,K,cu_num,gfx) shape keys with the model-specific
a8w8_blockscale_tuned_gemm_ds_v3.csv. The aiter build asserts no shape
collisions across merged config files; resolve by keeping the lowest
'us' row per shape:

- 187 ds_v3 rows dropped (superseded by #2981's better tunings)
- 36 #2981 rows dropped (ds_v3's existing tunings were faster)

Result: every conflicting shape still has a tuning, picked from
whichever file had the better measurement.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [configs] dedup colliding FMoE shapes across global + model_configs files

Same lowest-'us'-wins resolution as the GEMM dedup, applied to FMoE.
The build's update_config_files asserts no shape collisions across
merged tuned_fmoe files (key = untuned_fmoe.csv columns + cu_num + _tag);
the additions in this PR introduced 65 cross-file collisions between
tuned_fmoe.csv and 4 model_configs files.

Resolution (best 'us' per shape):
- tuned_fmoe.csv: 1080 -> 1039 rows (lost 41 to model files with better
  existing tunings — mostly 26 minimax + 10 glm47 + 4 ds_v3 + 1 qwen3_235b)
- a8w8_blockscale_tuned_fmoe_ds_v3.csv: 16 -> 4 (12 superseded by #2981)
- a8w8_blockscale_tuned_fmoe_minimax-m2_5.csv: 32 -> 26 (6 superseded by #2982)
- a8w8_blockscale_tuned_fmoe_qwen3_235b.csv: 32 -> 28 (4 superseded by #2981)
- glm47_fp8_tuned_fmoe.csv: 16 -> 14 (2 superseded by #2981)

Every shape contributed by #2981 and #2982 remains covered post-dedup —
where their row was not the winner, the existing model-specific tuning
wins on its own merits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [configs] Move MiniMax FMoE tunings to model config
Keep the surviving MiniMax M25 FMoE rows in the model-specific config
file instead of the global tuned_fmoe table.

* [configs] Move DeepSeek tunings to model configs
Keep the surviving DeepSeek V3.2 tuning rows in model-specific config
files instead of the global tuning tables.

* [configs] Move GLM FMoE rows back to GLM config

Address likely MiniMax tuning contamination by moving the per-token
GLM-4.7 FMoE entries back to the GLM model config where they belong.

* [moe] Disable blockscale GEMM2 instance that exceeds LDS
Avoid prebuilding the 256x128x128x128 2x2 blockscale GEMM2
candidate, which exceeds the local memory limit during JIT compilation.

* [moe] Disable risky/unused blockscale MoE instances

---------

Co-authored-by: Aakif Nawaz <aaknawaz@amd.com>
Co-authored-by: Aakif Nawaz <aakif.nawaz@amd.com>
Co-authored-by: frida-andersson <fanderss@amd.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants