refresh Kimi K2 FP4 fused MoE tunings (TP2 / 256 CU) by xaguilar-amd · Pull Request #2938 · ROCm/aiter

xaguilar-amd · 2026-04-28T12:25:36Z

Summary

Updates aiter/configs/model_configs/kimik2_fp4_tuned_fmoe.csv with a new round of tuned fused MoE kernel selections for Kimi K2–style FP4 MoE, tuned for MI355X.

What changed

Re-selected stage-1 / stage-2 kernels (FlyDSL + CK mix) across token counts and expert geometries (inter_dim 256 / 512, expert counts 384/8 and 385/9, plus inter_dim = 1024 / 385/9 where new rows were added).
Replaced many prior flydsl_fallback rows that used pure CK two-stage GEMMs when FlyDSL was unavailable with concrete FlyDSL MoE kernels where tuning shows a win, including populated timing / TFLOPS / bandwidth metadata where applicable.

Motivation

The existing table mixed strong FlyDSL choices with a large fallback-only region. This refresh aligns the shipped config with measured best kernels for the TP2-style (256 CU) layout and extends coverage for additional intermediate / routed shapes used by the model.

github-actions · 2026-04-28T12:26:17Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:triton-300x`	Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
`ci:sglang`	SGLang integration tests
`ci:atom`	ATOM benchmark (DeepSeek-R1 + GPT-OSS)
`ci:vllm`	vLLM benchmark
`ci:all`	All of the above

Add labels via the sidebar or gh pr edit 2938 --add-label <label>

xaguilar-amd · 2026-04-30T16:58:00Z

The CI is failing due to Docker Hub rate limits, not code issues:
toomanyrequests: You have reached your unauthenticated pull rate limit

@sunway513 Could you please help resolving this? Thanks!

sunway513 · 2026-05-04T14:56:14Z

This PR's content was bulk-merged via #3004 ([Silo] Bulk merge: tuned GEMM and FMoE configs, merged 2026-05-02 03:16 UTC). Please close this PR as superseded.

Tracking issue: ROCm/AI-Frameworks-Dashboard#141

Squash-merged from main commit 52c4554. Includes 5 atomic Silo PRs: - #2923 GLM-4.7 FP8 tuned/untuned FMoE configs (new) - #2938 Kimi-K2.5 FP4 fused MoE tunings (TP2 / 256 CU refresh) - #2979 MiniMax-M2.5 A8W8 blockscale GEMM tunings - #2981 DeepSeek-V3.2 MI355X tuned GEMM and FMoE configs - #2982 MiniMax-M2.5 FMoE tunings Conflict in aiter/configs/model_configs/kimik2_fp4_tuned_fmoe.csv: two blocks resolved by taking theirs (Silo). Block 1 upgrades existing M=256/N=512 rows from base kernel suffixes (w3) to tuner-discovered variants (w3_xcd4, _bnt2_persist, _sbm32, _sbm64). Block 2 is purely additive: 30+ new rows for previously-uncovered N=7168/K=1024 shapes plus a flydsl_fallback section. Driver: vLLM 0.21 freeze 2026-05-08 — Silo customers need these tunings on the AITER release wheel, not nightly. Verification gate before tag: - Kernel suffix parser smoke (Kimi-K2.5-MXFP4 1-token inference, confirm new suffixes JIT-compile without falling back) - ATOM 5-model accuracy unchanged within +/- 0.005 vs v0.1.13-rc1 - Perf delta on Kimi-K2.5 / MiniMax-M2.5 / DSv3.2 (expect flat or better) (cherry picked from commit 52c4554)

Added MI355X MoE tunings for Kimi-K2 FP4 TP2

8920160

xaguilar-amd marked this pull request as ready for review April 28, 2026 12:29

xaguilar-amd requested a review from a team April 28, 2026 12:29

xaguilar-amd mentioned this pull request Apr 30, 2026

Tuned CK MoE kernels for Kimi-K2.5-MXFP4 TP2, TP4, and TP8 #2386

Closed

sunway513 mentioned this pull request May 1, 2026

[Silo] Bulk merge: tuned GEMM and FMoE configs (GLM-4.7, Kimi-K2.5, MiniMax-M2.5, DeepSeek-V3.2) #3004

Merged

2 tasks

azaidy mentioned this pull request May 4, 2026

[Silo] Add configs missing from bulk merge #3004 #3024

Merged

2 tasks

xaguilar-amd closed this May 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refresh Kimi K2 FP4 fused MoE tunings (TP2 / 256 CU)#2938

refresh Kimi K2 FP4 fused MoE tunings (TP2 / 256 CU)#2938
xaguilar-amd wants to merge 1 commit intoROCm:mainfrom
xaguilar-amd:kimik2_fp4_tp2_tunings

xaguilar-amd commented Apr 28, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 28, 2026

Uh oh!

xaguilar-amd commented Apr 30, 2026 •

edited

Loading

Uh oh!

sunway513 commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

xaguilar-amd commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Motivation

Uh oh!

github-actions Bot commented Apr 28, 2026

🏷️ CI Guide

Uh oh!

xaguilar-amd commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sunway513 commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xaguilar-amd commented Apr 28, 2026 •

edited

Loading

xaguilar-amd commented Apr 30, 2026 •

edited

Loading