Skip to content

Add MiniMax M25 FMoE tunings#2982

Closed
akii96 wants to merge 2 commits intomainfrom
add-minimax-m25-fmoe-tunings
Closed

Add MiniMax M25 FMoE tunings#2982
akii96 wants to merge 2 commits intomainfrom
add-minimax-m25-fmoe-tunings

Conversation

@akii96
Copy link
Copy Markdown
Contributor

@akii96 akii96 commented Apr 30, 2026

Adds MiniMax M25 FMoE tuning entries and keeps the tuning table deduplicated and sorted by token

@github-actions
Copy link
Copy Markdown
Contributor

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-300x Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
ci:sglang SGLang integration tests
ci:atom ATOM benchmark (DeepSeek-R1 + GPT-OSS)
ci:vllm vLLM benchmark
ci:all All of the above

Add labels via the sidebar or gh pr edit 2982 --add-label <label>

@akii96 akii96 force-pushed the add-minimax-m25-fmoe-tunings branch from 1979e04 to 374f437 Compare May 4, 2026 13:40
@akii96 akii96 marked this pull request as ready for review May 4, 2026 13:41
@akii96 akii96 requested review from a team and amd-yashagar May 4, 2026 13:41
@akii96
Copy link
Copy Markdown
Contributor Author

akii96 commented May 4, 2026

Latest updates:

  • add extra instances based on previous commit by @amd-yashagar
  • added the cleaned up fmoe tunings and not some bulk dump of 1000s of entries

@sunway513
Copy link
Copy Markdown
Collaborator

This PR's content was bulk-merged via #3004 ([Silo] Bulk merge: tuned GEMM and FMoE configs, merged 2026-05-02 03:16 UTC). Please close this PR as superseded.

Tracking issue: ROCm/AI-Frameworks-Dashboard#141

sunway513 added a commit that referenced this pull request May 4, 2026
Squash-merged from main commit 52c4554.

Includes 5 atomic Silo PRs:
- #2923 GLM-4.7 FP8 tuned/untuned FMoE configs (new)
- #2938 Kimi-K2.5 FP4 fused MoE tunings (TP2 / 256 CU refresh)
- #2979 MiniMax-M2.5 A8W8 blockscale GEMM tunings
- #2981 DeepSeek-V3.2 MI355X tuned GEMM and FMoE configs
- #2982 MiniMax-M2.5 FMoE tunings

Conflict in aiter/configs/model_configs/kimik2_fp4_tuned_fmoe.csv:
two blocks resolved by taking theirs (Silo). Block 1 upgrades existing
M=256/N=512 rows from base kernel suffixes (w3) to tuner-discovered
variants (w3_xcd4, _bnt2_persist, _sbm32, _sbm64). Block 2 is purely
additive: 30+ new rows for previously-uncovered N=7168/K=1024 shapes
plus a flydsl_fallback section.

Driver: vLLM 0.21 freeze 2026-05-08 — Silo customers need these tunings
on the AITER release wheel, not nightly.

Verification gate before tag:
- Kernel suffix parser smoke (Kimi-K2.5-MXFP4 1-token inference,
  confirm new suffixes JIT-compile without falling back)
- ATOM 5-model accuracy unchanged within +/- 0.005 vs v0.1.13-rc1
- Perf delta on Kimi-K2.5 / MiniMax-M2.5 / DSv3.2 (expect flat or better)

(cherry picked from commit 52c4554)
@akii96
Copy link
Copy Markdown
Contributor Author

akii96 commented May 4, 2026

Hi @sunway513

One clarification before closing: this PR is not fully covered by the #3004 config-only bulk merge. In aiter/configs/tuned_fmoe.csv, the new MiniMax FMoE rows for token=4096 and token=8192 with model_dim=3072, inter_dim=384 reference the added 256x64x128x128 ... A8W8blkscale_v1 CK two-stage instances from csrc/ck_gemm_moe_2stages_codegen/gemm_moe_ck2stages_common.py. So those CSV rows do depend on the instance additions, which were left out of #3004.

azaidy added a commit that referenced this pull request May 4, 2026
…iles

Same lowest-'us'-wins resolution as the GEMM dedup, applied to FMoE.
The build's update_config_files asserts no shape collisions across
merged tuned_fmoe files (key = untuned_fmoe.csv columns + cu_num + _tag);
the additions in this PR introduced 65 cross-file collisions between
tuned_fmoe.csv and 4 model_configs files.

Resolution (best 'us' per shape):
- tuned_fmoe.csv: 1080 -> 1039 rows (lost 41 to model files with better
  existing tunings — mostly 26 minimax + 10 glm47 + 4 ds_v3 + 1 qwen3_235b)
- a8w8_blockscale_tuned_fmoe_ds_v3.csv: 16 -> 4 (12 superseded by #2981)
- a8w8_blockscale_tuned_fmoe_minimax-m2_5.csv: 32 -> 26 (6 superseded by #2982)
- a8w8_blockscale_tuned_fmoe_qwen3_235b.csv: 32 -> 28 (4 superseded by #2981)
- glm47_fp8_tuned_fmoe.csv: 16 -> 14 (2 superseded by #2981)

Every shape contributed by #2981 and #2982 remains covered post-dedup —
where their row was not the winner, the existing model-specific tuning
wins on its own merits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@akii96 akii96 marked this pull request as draft May 5, 2026 11:41
sunway513 pushed a commit that referenced this pull request May 5, 2026
* Add MiniMax M25 A8W8 blockscale GEMM tunings on gfx950 (splitK + AQRowMajor)

* instances

* Add MiniMax M25 FMoE cleaned entries

* [configs] Add MI355X tuned GEMM and FMoE configs for DeepSeek-V3.2

Add gfx950 (MI355X, cu_num=256) tuning results for A8W8 block-scale
GEMM and fused MoE kernels, optimized for DeepSeek-V3.2 shapes.

GEMM (a8w8_blockscale_tuned_gemm.csv):
- 6375 entries covering M=1..8192 for all DSv32 (N,K) shapes
- Includes split-K tuned configs per shape (best of splitK=0 vs splitK>0)
- Key decode (M=1) improvements: 128x7168 -59%, 7168x4096 -33%

FMoE (tuned_fmoe.csv):
- 802 cu_num=256 entries for DSv32 expert dimensions
  (N=512/4096/4608/7168, K=1536/7168/9216)
- Replaces 751 previous cu_num=256 entries with re-tuned results
- Existing cu_num=80 (MI300X) entries unchanged

Made-with: Cursor

(Cherry-picked from #2981 to restore content lost in bulk merge #3004.
Net semantic effect of this PR vs current main:
  GEMM: +6375 / -5
  FMoE: +57 / -7
The remaining 1042 of #2981's 1099 textual FMoE adds are content-
identical reorderings already present on main.)

* [configs] dedup colliding (M,N,K,cu_num,gfx) shapes between #2981 and ds_v3

#2981 added 223 DSv3.2 GEMM tunings to a8w8_blockscale_tuned_gemm.csv
that share (M,N,K,cu_num,gfx) shape keys with the model-specific
a8w8_blockscale_tuned_gemm_ds_v3.csv. The aiter build asserts no shape
collisions across merged config files; resolve by keeping the lowest
'us' row per shape:

- 187 ds_v3 rows dropped (superseded by #2981's better tunings)
- 36 #2981 rows dropped (ds_v3's existing tunings were faster)

Result: every conflicting shape still has a tuning, picked from
whichever file had the better measurement.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [configs] dedup colliding FMoE shapes across global + model_configs files

Same lowest-'us'-wins resolution as the GEMM dedup, applied to FMoE.
The build's update_config_files asserts no shape collisions across
merged tuned_fmoe files (key = untuned_fmoe.csv columns + cu_num + _tag);
the additions in this PR introduced 65 cross-file collisions between
tuned_fmoe.csv and 4 model_configs files.

Resolution (best 'us' per shape):
- tuned_fmoe.csv: 1080 -> 1039 rows (lost 41 to model files with better
  existing tunings — mostly 26 minimax + 10 glm47 + 4 ds_v3 + 1 qwen3_235b)
- a8w8_blockscale_tuned_fmoe_ds_v3.csv: 16 -> 4 (12 superseded by #2981)
- a8w8_blockscale_tuned_fmoe_minimax-m2_5.csv: 32 -> 26 (6 superseded by #2982)
- a8w8_blockscale_tuned_fmoe_qwen3_235b.csv: 32 -> 28 (4 superseded by #2981)
- glm47_fp8_tuned_fmoe.csv: 16 -> 14 (2 superseded by #2981)

Every shape contributed by #2981 and #2982 remains covered post-dedup —
where their row was not the winner, the existing model-specific tuning
wins on its own merits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [configs] Move MiniMax FMoE tunings to model config
Keep the surviving MiniMax M25 FMoE rows in the model-specific config
file instead of the global tuned_fmoe table.

* [configs] Move DeepSeek tunings to model configs
Keep the surviving DeepSeek V3.2 tuning rows in model-specific config
files instead of the global tuning tables.

* [configs] Move GLM FMoE rows back to GLM config

Address likely MiniMax tuning contamination by moving the per-token
GLM-4.7 FMoE entries back to the GLM model config where they belong.

* [moe] Disable blockscale GEMM2 instance that exceeds LDS
Avoid prebuilding the 256x128x128x128 2x2 blockscale GEMM2
candidate, which exceeds the local memory limit during JIT compilation.

* [moe] Disable risky/unused blockscale MoE instances

---------

Co-authored-by: Aakif Nawaz <aaknawaz@amd.com>
Co-authored-by: Aakif Nawaz <aakif.nawaz@amd.com>
Co-authored-by: frida-andersson <fanderss@amd.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@akii96
Copy link
Copy Markdown
Contributor Author

akii96 commented May 5, 2026

Merged with #3024

@akii96 akii96 closed this May 5, 2026
@akii96 akii96 deleted the add-minimax-m25-fmoe-tunings branch May 5, 2026 19:17
Liang-jianhao97 pushed a commit that referenced this pull request May 7, 2026
* Add MiniMax M25 A8W8 blockscale GEMM tunings on gfx950 (splitK + AQRowMajor)

* instances

* Add MiniMax M25 FMoE cleaned entries

* [configs] Add MI355X tuned GEMM and FMoE configs for DeepSeek-V3.2

Add gfx950 (MI355X, cu_num=256) tuning results for A8W8 block-scale
GEMM and fused MoE kernels, optimized for DeepSeek-V3.2 shapes.

GEMM (a8w8_blockscale_tuned_gemm.csv):
- 6375 entries covering M=1..8192 for all DSv32 (N,K) shapes
- Includes split-K tuned configs per shape (best of splitK=0 vs splitK>0)
- Key decode (M=1) improvements: 128x7168 -59%, 7168x4096 -33%

FMoE (tuned_fmoe.csv):
- 802 cu_num=256 entries for DSv32 expert dimensions
  (N=512/4096/4608/7168, K=1536/7168/9216)
- Replaces 751 previous cu_num=256 entries with re-tuned results
- Existing cu_num=80 (MI300X) entries unchanged

Made-with: Cursor

(Cherry-picked from #2981 to restore content lost in bulk merge #3004.
Net semantic effect of this PR vs current main:
  GEMM: +6375 / -5
  FMoE: +57 / -7
The remaining 1042 of #2981's 1099 textual FMoE adds are content-
identical reorderings already present on main.)

* [configs] dedup colliding (M,N,K,cu_num,gfx) shapes between #2981 and ds_v3

#2981 added 223 DSv3.2 GEMM tunings to a8w8_blockscale_tuned_gemm.csv
that share (M,N,K,cu_num,gfx) shape keys with the model-specific
a8w8_blockscale_tuned_gemm_ds_v3.csv. The aiter build asserts no shape
collisions across merged config files; resolve by keeping the lowest
'us' row per shape:

- 187 ds_v3 rows dropped (superseded by #2981's better tunings)
- 36 #2981 rows dropped (ds_v3's existing tunings were faster)

Result: every conflicting shape still has a tuning, picked from
whichever file had the better measurement.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [configs] dedup colliding FMoE shapes across global + model_configs files

Same lowest-'us'-wins resolution as the GEMM dedup, applied to FMoE.
The build's update_config_files asserts no shape collisions across
merged tuned_fmoe files (key = untuned_fmoe.csv columns + cu_num + _tag);
the additions in this PR introduced 65 cross-file collisions between
tuned_fmoe.csv and 4 model_configs files.

Resolution (best 'us' per shape):
- tuned_fmoe.csv: 1080 -> 1039 rows (lost 41 to model files with better
  existing tunings — mostly 26 minimax + 10 glm47 + 4 ds_v3 + 1 qwen3_235b)
- a8w8_blockscale_tuned_fmoe_ds_v3.csv: 16 -> 4 (12 superseded by #2981)
- a8w8_blockscale_tuned_fmoe_minimax-m2_5.csv: 32 -> 26 (6 superseded by #2982)
- a8w8_blockscale_tuned_fmoe_qwen3_235b.csv: 32 -> 28 (4 superseded by #2981)
- glm47_fp8_tuned_fmoe.csv: 16 -> 14 (2 superseded by #2981)

Every shape contributed by #2981 and #2982 remains covered post-dedup —
where their row was not the winner, the existing model-specific tuning
wins on its own merits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [configs] Move MiniMax FMoE tunings to model config
Keep the surviving MiniMax M25 FMoE rows in the model-specific config
file instead of the global tuned_fmoe table.

* [configs] Move DeepSeek tunings to model configs
Keep the surviving DeepSeek V3.2 tuning rows in model-specific config
files instead of the global tuning tables.

* [configs] Move GLM FMoE rows back to GLM config

Address likely MiniMax tuning contamination by moving the per-token
GLM-4.7 FMoE entries back to the GLM model config where they belong.

* [moe] Disable blockscale GEMM2 instance that exceeds LDS
Avoid prebuilding the 256x128x128x128 2x2 blockscale GEMM2
candidate, which exceeds the local memory limit during JIT compilation.

* [moe] Disable risky/unused blockscale MoE instances

---------

Co-authored-by: Aakif Nawaz <aaknawaz@amd.com>
Co-authored-by: Aakif Nawaz <aakif.nawaz@amd.com>
Co-authored-by: frida-andersson <fanderss@amd.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants