Skip to content

refresh Kimi K2 FP4 fused MoE tunings (TP2 / 256 CU)#2938

Closed
xaguilar-amd wants to merge 1 commit intoROCm:mainfrom
xaguilar-amd:kimik2_fp4_tp2_tunings
Closed

refresh Kimi K2 FP4 fused MoE tunings (TP2 / 256 CU)#2938
xaguilar-amd wants to merge 1 commit intoROCm:mainfrom
xaguilar-amd:kimik2_fp4_tp2_tunings

Conversation

@xaguilar-amd
Copy link
Copy Markdown
Contributor

@xaguilar-amd xaguilar-amd commented Apr 28, 2026

Summary

Updates aiter/configs/model_configs/kimik2_fp4_tuned_fmoe.csv with a new round of tuned fused MoE kernel selections for Kimi K2–style FP4 MoE, tuned for MI355X.

What changed

  • Re-selected stage-1 / stage-2 kernels (FlyDSL + CK mix) across token counts and expert geometries (inter_dim 256 / 512, expert counts 384/8 and 385/9, plus inter_dim = 1024 / 385/9 where new rows were added).
  • Replaced many prior flydsl_fallback rows that used pure CK two-stage GEMMs when FlyDSL was unavailable with concrete FlyDSL MoE kernels where tuning shows a win, including populated timing / TFLOPS / bandwidth metadata where applicable.

Motivation

The existing table mixed strong FlyDSL choices with a large fallback-only region. This refresh aligns the shipped config with measured best kernels for the TP2-style (256 CU) layout and extends coverage for additional intermediate / routed shapes used by the model.

@github-actions
Copy link
Copy Markdown
Contributor

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-300x Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
ci:sglang SGLang integration tests
ci:atom ATOM benchmark (DeepSeek-R1 + GPT-OSS)
ci:vllm vLLM benchmark
ci:all All of the above

Add labels via the sidebar or gh pr edit 2938 --add-label <label>

@xaguilar-amd xaguilar-amd marked this pull request as ready for review April 28, 2026 12:29
@xaguilar-amd xaguilar-amd requested a review from a team April 28, 2026 12:29
@xaguilar-amd
Copy link
Copy Markdown
Contributor Author

xaguilar-amd commented Apr 30, 2026

The CI is failing due to Docker Hub rate limits, not code issues:
toomanyrequests: You have reached your unauthenticated pull rate limit

@sunway513 Could you please help resolving this? Thanks!

@sunway513
Copy link
Copy Markdown
Collaborator

This PR's content was bulk-merged via #3004 ([Silo] Bulk merge: tuned GEMM and FMoE configs, merged 2026-05-02 03:16 UTC). Please close this PR as superseded.

Tracking issue: ROCm/AI-Frameworks-Dashboard#141

sunway513 added a commit that referenced this pull request May 4, 2026
Squash-merged from main commit 52c4554.

Includes 5 atomic Silo PRs:
- #2923 GLM-4.7 FP8 tuned/untuned FMoE configs (new)
- #2938 Kimi-K2.5 FP4 fused MoE tunings (TP2 / 256 CU refresh)
- #2979 MiniMax-M2.5 A8W8 blockscale GEMM tunings
- #2981 DeepSeek-V3.2 MI355X tuned GEMM and FMoE configs
- #2982 MiniMax-M2.5 FMoE tunings

Conflict in aiter/configs/model_configs/kimik2_fp4_tuned_fmoe.csv:
two blocks resolved by taking theirs (Silo). Block 1 upgrades existing
M=256/N=512 rows from base kernel suffixes (w3) to tuner-discovered
variants (w3_xcd4, _bnt2_persist, _sbm32, _sbm64). Block 2 is purely
additive: 30+ new rows for previously-uncovered N=7168/K=1024 shapes
plus a flydsl_fallback section.

Driver: vLLM 0.21 freeze 2026-05-08 — Silo customers need these tunings
on the AITER release wheel, not nightly.

Verification gate before tag:
- Kernel suffix parser smoke (Kimi-K2.5-MXFP4 1-token inference,
  confirm new suffixes JIT-compile without falling back)
- ATOM 5-model accuracy unchanged within +/- 0.005 vs v0.1.13-rc1
- Perf delta on Kimi-K2.5 / MiniMax-M2.5 / DSv3.2 (expect flat or better)

(cherry picked from commit 52c4554)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants