Skip to content

Add MiniMax M25 A8W8 blockscale GEMM tunings#2979

Closed
akii96 wants to merge 1 commit intomainfrom
gemm-tuning-minimax-m25-gfx950
Closed

Add MiniMax M25 A8W8 blockscale GEMM tunings#2979
akii96 wants to merge 1 commit intomainfrom
gemm-tuning-minimax-m25-gfx950

Conversation

@akii96
Copy link
Copy Markdown
Contributor

@akii96 akii96 commented Apr 30, 2026

Adds MiniMax M25 A8W8 blockscale GEMM tuning entries and keeps the tuning table deduplicated and sorted

Pre-requirments for this to be merged:

@akii96 akii96 requested a review from a team April 30, 2026 13:32
@github-actions
Copy link
Copy Markdown
Contributor

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-300x Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
ci:sglang SGLang integration tests
ci:atom ATOM benchmark (DeepSeek-R1 + GPT-OSS)
ci:vllm vLLM benchmark
ci:all All of the above

Add labels via the sidebar or gh pr edit 2979 --add-label <label>

@akii96 akii96 marked this pull request as draft April 30, 2026 13:40
@akii96
Copy link
Copy Markdown
Contributor Author

akii96 commented Apr 30, 2026

I checked now that the upstream aiter main has now already changed a bit and now takes an additional gfx_arch column as input.

This should be a small fix and I can address once I rebase and test it myself on workloads

@akii96
Copy link
Copy Markdown
Contributor Author

akii96 commented May 4, 2026

Latest updates

  • Rebased onto current origin/main

  • Re-tuned the full M25 shape set on MI355X (gfx950, cu_num=256, 8 GPUs)

Dependency on PR #2541

@akii96 akii96 marked this pull request as ready for review May 4, 2026 06:58
@akii96 akii96 requested a review from amd-yashagar May 4, 2026 06:59
@amd-yashagar
Copy link
Copy Markdown
Contributor

Looks good to me. Thank you @akii96.

@akii96 akii96 force-pushed the gemm-tuning-minimax-m25-gfx950 branch from aa8d197 to 4f555c5 Compare May 4, 2026 11:33
@sunway513
Copy link
Copy Markdown
Collaborator

This PR's content was bulk-merged via #3004 ([Silo] Bulk merge: tuned GEMM and FMoE configs, merged 2026-05-02 03:16 UTC). Please close this PR as superseded.

Tracking issue: ROCm/AI-Frameworks-Dashboard#141

sunway513 added a commit that referenced this pull request May 4, 2026
Squash-merged from main commit 52c4554.

Includes 5 atomic Silo PRs:
- #2923 GLM-4.7 FP8 tuned/untuned FMoE configs (new)
- #2938 Kimi-K2.5 FP4 fused MoE tunings (TP2 / 256 CU refresh)
- #2979 MiniMax-M2.5 A8W8 blockscale GEMM tunings
- #2981 DeepSeek-V3.2 MI355X tuned GEMM and FMoE configs
- #2982 MiniMax-M2.5 FMoE tunings

Conflict in aiter/configs/model_configs/kimik2_fp4_tuned_fmoe.csv:
two blocks resolved by taking theirs (Silo). Block 1 upgrades existing
M=256/N=512 rows from base kernel suffixes (w3) to tuner-discovered
variants (w3_xcd4, _bnt2_persist, _sbm32, _sbm64). Block 2 is purely
additive: 30+ new rows for previously-uncovered N=7168/K=1024 shapes
plus a flydsl_fallback section.

Driver: vLLM 0.21 freeze 2026-05-08 — Silo customers need these tunings
on the AITER release wheel, not nightly.

Verification gate before tag:
- Kernel suffix parser smoke (Kimi-K2.5-MXFP4 1-token inference,
  confirm new suffixes JIT-compile without falling back)
- ATOM 5-model accuracy unchanged within +/- 0.005 vs v0.1.13-rc1
- Perf delta on Kimi-K2.5 / MiniMax-M2.5 / DSv3.2 (expect flat or better)

(cherry picked from commit 52c4554)
@akii96 akii96 marked this pull request as draft May 5, 2026 11:41
@akii96
Copy link
Copy Markdown
Contributor Author

akii96 commented May 5, 2026

Merged with #3024

@akii96 akii96 closed this May 5, 2026
@akii96 akii96 deleted the gemm-tuning-minimax-m25-gfx950 branch May 5, 2026 19:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants