Add GLM-4.7 FP8 tuned and untuned FMOE configs by omirosh · Pull Request #2923 · ROCm/aiter

omirosh · 2026-04-27T15:18:42Z

Motivation

GLM-4.7-FP8 is missing tuned fMOEs when running in TP4 + expert parallel mode on MI355x

Technical Details

Add aiter/configs/model_configs/glm47_fp8_tuned_fmoe.csv (tuned FMOE shapes for GLM-4.7 in FP8)
Add aiter/configs/model_configs/glm47_fp8_untuned_fmoe.csv (untuned reference)

Test Plan

vLLM startup log shows …/model_configs/glm47_fp8_tuned_fmoe.csv listed in the [aiter] merge tuned file under model_configs/ and configs/ … line.
GLM-4.7 FP8 inference run uses the new tuned shapes.

github-actions · 2026-04-27T15:19:05Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:triton-300x`	Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
`ci:sglang`	SGLang integration tests
`ci:atom`	ATOM benchmark (DeepSeek-R1 + GPT-OSS)
`ci:vllm`	vLLM benchmark
`ci:all`	All of the above

Add labels via the sidebar or gh pr edit 2923 --add-label <label>

Copilot

Pull request overview

This PR adds GLM-4.7 FP8 fMOE shape configuration CSVs so AITER can pick up tuned kernels for the GLM-4.7 FP8 MoE path (notably for TP4 + expert parallel on MI355x).

Changes:

Added a new tuned fMOE config CSV for GLM-4.7 FP8 (glm47_fp8_tuned_fmoe.csv).
Added a new untuned/reference fMOE config CSV for GLM-4.7 FP8 (glm47_fp8_untuned_fmoe.csv).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
aiter/configs/model_configs/glm47_fp8_tuned_fmoe.csv	Adds tuned fMOE kernel selections and performance metadata for GLM-4.7 FP8 shapes.
aiter/configs/model_configs/glm47_fp8_untuned_fmoe.csv	Adds the corresponding untuned/reference shape list for GLM-4.7 FP8.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-27T15:25:44Z

+32,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0
+64,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0
+128,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0
+256,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0
+512,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0
+1024,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0
+2048,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0
+4096,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0
+8192,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0
+1,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0
+2,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0
+4,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0
+8,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0
+16,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0
+16384,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0
+32768,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0


The token sizes in this untuned reference file are not in a consistent order (starts at 32..8192, then 1..16, then 16384..32768). Since other untuned_fmoe.csv files in model_configs are generally grouped/sorted for readability and easier diffing, consider sorting/grouping these rows (e.g., ascending token) to make the file easier to maintain.

Copilot · 2026-04-27T15:25:45Z

+256,32,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0,32,0,164.2209,_ZN5aiter45fmoe_bf16_pertokenFp8_g1u1_vs_silu_1tg_32x256E,0.00%,0,,0.00%,164.2209,1,73.56,5749.63
+256,64,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0,32,0,168.1699,_ZN5aiter45fmoe_bf16_pertokenFp8_g1u1_vs_silu_1tg_32x256E,0.00%,0,,0.00%,168.1699,1,143.66,5617.54
+256,128,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0,32,0,177.1016,_ZN5aiter48fmoe_bf16_pertokenFp8_g1u1_vs_silu_1tg_ps_32x384E,0.00%,0,,0.00%,177.1016,1,272.83,5339.79
+256,256,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0,64,0,111.1445,_ZN5aiter48fmoe_stage1_bf16_pertokenFp8_g1u1_64x128_2tg_pf3E,0.00%,94.3821,moe_ck2stages_gemm2_256x64x128x256_1x4_TypeCast_v1_Nswizzle0_Quant2_MulRoutedWeight1_F8_F8_B16,3.80%,205.5266,0,470.19,4610.84
+256,512,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0,128,0,128.617,_ZN5aiter45fmoe_stage1_bf16_pertokenFp8_g1u1_128x128_pf3E,0.00%,136.7373,moe_ck2stages_gemm2_256x128x128x128_1x4_TypeCast_v3_Nswizzle0_Quant2_MulRoutedWeight1_F8_F8_B16,3.80%,265.3543,0,728.36,3586.08
+256,1024,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0,64,0,227.447,moe_ck2stages_gemm1_256x64x64x256_1x4_TypeCast_v1_Nswizzle0_Quant2_MulRoutedWeight0_silu_F8_F8_B16,0.00%,192.093,moe_ck2stages_gemm2_256x64x128x256_1x4_TypeCast_v1_Nswizzle0_Quant2_MulRoutedWeight1_F8_F8_B16,3.80%,419.54,0,921.36,2286.9
+256,2048,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0,128,0,352.8768,moe_ck2stages_gemm1_256x128x64x128_1x4_TypeCast_v1_Nswizzle0_Quant2_MulRoutedWeight0_silu_F8_F8_B16,0.00%,345.4655,moe_ck2stages_gemm2_256x128x128x128_1x4_TypeCast_v3_Nswizzle0_Quant2_MulRoutedWeight1_F8_F8_B16,3.70%,698.3423,0,1107.04,1396.42
+256,4096,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0,128,0,582.0723,moe_ck2stages_gemm1_256x128x128x128_1x4_TypeCast_v3_Nswizzle0_Quant2_MulRoutedWeight0_silu_F8_F8_B16,0.00%,641.9346,moe_ck2stages_gemm2_256x128x128x128_1x4_TypeCast_v3_Nswizzle0_Quant2_MulRoutedWeight1_F8_F8_B16,3.80%,1224.0069,0,1263.22,822.41
+256,8192,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0,128,0,1078.8184,moe_ck2stages_gemm1_256x128x64x128_1x4_TypeCast_v1_Nswizzle0_Quant2_MulRoutedWeight0_silu_F8_F8_B16,0.00%,1239.192,moe_ck2stages_gemm2_256x128x128x128_1x4_TypeCast_v1_Nswizzle0_Quant2_MulRoutedWeight1_F8_F8_B16,3.80%,2318.0104,0,1334.06,461.41
+256,1,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0,32,0,24.9944,moe_ck2stages_gemm1_256x32x64x256_1x4_TypeCast_v1_Nswizzle0_Quant2_MulRoutedWeight0_silu_F8_F8_B16,0.00%,20.3787,moe_ck2stages_gemm2_256x32x128x256_1x4_TypeCast_v1_Nswizzle0_Quant2_MulRoutedWeight1_F8_F8_B16,3.80%,45.3731,0,8.32,20799.41
+256,2,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0,32,0,40.11,_ZN5aiter48fmoe_stage1_bf16_pertokenFp8_g1u1_32x128_3tg_pf3E,0.00%,28.5666,moe_ck2stages_gemm2_256x32x128x256_1x4_TypeCast_v1_Nswizzle0_Quant2_MulRoutedWeight1_F8_F8_B16,3.80%,68.6766,0,10.99,13741.93
+256,4,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0,32,0,99.8134,_ZN5aiter45fmoe_bf16_pertokenFp8_g1u1_vs_silu_1tg_32x192E,0.00%,0,,0.00%,99.8134,1,15.13,9455.44
+256,8,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0,32,0,134.9101,_ZN5aiter45fmoe_bf16_pertokenFp8_g1u1_vs_silu_1tg_32x256E,0.00%,0,,0.00%,134.9101,1,22.38,6996.08
+256,16,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0,32,0,156.6731,_ZN5aiter45fmoe_bf16_pertokenFp8_g1u1_vs_silu_1tg_32x256E,0.00%,0,,0.00%,156.6731,1,38.55,6025.06
+256,16384,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0,128,0,1988.7154,moe_ck2stages_gemm1_256x128x64x128_1x4_TypeCast_v1_Nswizzle0_Quant2_MulRoutedWeight0_silu_F8_F8_B16,0.00%,2328.3393,moe_ck2stages_gemm2_256x128x128x128_1x4_TypeCast_v1_Nswizzle0_Quant2_MulRoutedWeight1_F8_F8_B16,3.80%,4317.0547,0,1432.63,276.9
+256,32768,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0,128,0,3930.1981,moe_ck2stages_gemm1_256x128x64x128_1x4_TypeCast_v1_Nswizzle0_Quant2_MulRoutedWeight0_silu_F8_F8_B16,0.00%,4488.9205,moe_ck2stages_gemm2_256x128x128x128_1x4_TypeCast_v1_Nswizzle0_Quant2_MulRoutedWeight1_F8_F8_B16,3.80%,8419.1186,0,1469.22,171.87


The rows in this tuned config aren’t ordered consistently by token (32..8192, then 1..16, then 16384..32768). Ordering the rows (e.g., ascending token) would improve maintainability and make it easier to compare against other tuned_fmoe CSVs and future updates.

sunway513 · 2026-05-04T14:56:12Z

This PR's content was bulk-merged via #3004 ([Silo] Bulk merge: tuned GEMM and FMoE configs, merged 2026-05-02 03:16 UTC). Please close this PR as superseded.

Tracking issue: ROCm/AI-Frameworks-Dashboard#141

Squash-merged from main commit 52c4554. Includes 5 atomic Silo PRs: - #2923 GLM-4.7 FP8 tuned/untuned FMoE configs (new) - #2938 Kimi-K2.5 FP4 fused MoE tunings (TP2 / 256 CU refresh) - #2979 MiniMax-M2.5 A8W8 blockscale GEMM tunings - #2981 DeepSeek-V3.2 MI355X tuned GEMM and FMoE configs - #2982 MiniMax-M2.5 FMoE tunings Conflict in aiter/configs/model_configs/kimik2_fp4_tuned_fmoe.csv: two blocks resolved by taking theirs (Silo). Block 1 upgrades existing M=256/N=512 rows from base kernel suffixes (w3) to tuner-discovered variants (w3_xcd4, _bnt2_persist, _sbm32, _sbm64). Block 2 is purely additive: 30+ new rows for previously-uncovered N=7168/K=1024 shapes plus a flydsl_fallback section. Driver: vLLM 0.21 freeze 2026-05-08 — Silo customers need these tunings on the AITER release wheel, not nightly. Verification gate before tag: - Kernel suffix parser smoke (Kimi-K2.5-MXFP4 1-token inference, confirm new suffixes JIT-compile without falling back) - ATOM 5-model accuracy unchanged within +/- 0.005 vs v0.1.13-rc1 - Perf delta on Kimi-K2.5 / MiniMax-M2.5 / DSv3.2 (expect flat or better) (cherry picked from commit 52c4554)

omirosh requested review from a team and Copilot April 27, 2026 15:18

Copilot started reviewing on behalf of omirosh April 27, 2026 15:20 View session

Copilot AI reviewed Apr 27, 2026

View reviewed changes

omirosh force-pushed the glm47-opt branch from e4f6c26 to 3283577 Compare April 27, 2026 17:07

Add GLM-4.7 FP8 tuned and untuned FMOE configs

262aca4

omirosh force-pushed the glm47-opt branch from 3283577 to 262aca4 Compare April 27, 2026 17:16

sunway513 mentioned this pull request May 1, 2026

[Silo] Bulk merge: tuned GEMM and FMoE configs (GLM-4.7, Kimi-K2.5, MiniMax-M2.5, DeepSeek-V3.2) #3004

Merged

2 tasks

omirosh closed this May 4, 2026

azaidy mentioned this pull request May 4, 2026

[Silo] Add configs missing from bulk merge #3004 #3024

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GLM-4.7 FP8 tuned and untuned FMOE configs#2923

Add GLM-4.7 FP8 tuned and untuned FMOE configs#2923
omirosh wants to merge 1 commit intoROCm:mainfrom
amdsiloai:glm47-opt

omirosh commented Apr 27, 2026

Uh oh!

github-actions Bot commented Apr 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 27, 2026

Uh oh!

Copilot AI Apr 27, 2026

Uh oh!

omirosh Apr 27, 2026

Uh oh!

sunway513 commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

omirosh commented Apr 27, 2026

Motivation

Technical Details

Test Plan

Uh oh!

github-actions Bot commented Apr 27, 2026

🏷️ CI Guide

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

omirosh Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

sunway513 commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants