[Silo] Bulk merge: tuned GEMM and FMoE configs (GLM-4.7, Kimi-K2.5, MiniMax-M2.5, DeepSeek-V3.2)#3004
[Silo] Bulk merge: tuned GEMM and FMoE configs (GLM-4.7, Kimi-K2.5, MiniMax-M2.5, DeepSeek-V3.2)#3004
Conversation
Add gfx950 (MI355X, cu_num=256) tuning results for A8W8 block-scale GEMM and fused MoE kernels, optimized for DeepSeek-V3.2 shapes. GEMM (a8w8_blockscale_tuned_gemm.csv): - 6375 entries covering M=1..8192 for all DSv32 (N,K) shapes - Includes split-K tuned configs per shape (best of splitK=0 vs splitK>0) - Key decode (M=1) improvements: 128x7168 -59%, 7168x4096 -33% FMoE (tuned_fmoe.csv): - 802 cu_num=256 entries for DSv32 expert dimensions (N=512/4096/4608/7168, K=1536/7168/9216) - Replaces 751 previous cu_num=256 entries with re-tuned results - Existing cu_num=80 (MI300X) entries unchanged Made-with: Cursor
🏷️ CI GuideRuns automatically on every PR:
Extended tests (opt-in via labels):
|
There was a problem hiding this comment.
Pull request overview
Bulk merge of multiple Silo “config-only” PRs, adding/updating tuned GEMM and fused-MoE (FMoE) CSV tables under aiter/configs/ (and aiter/configs/model_configs/) to improve kernel selection for specific models/hardware targets (e.g., MI355X / cu_num=256).
Changes:
- Update model-specific tuned FMoE table for Kimi K2.5 FP4 (
kimik2_fp4_tuned_fmoe.csv). - Add GLM-4.7 FP8 tuned + untuned FMoE shape tables (
glm47_fp8_{tuned,untuned}_fmoe.csv). - (Per PR description) bulk-update shared tuned tables for GEMM/FMoE under
aiter/configs/.
Reviewed changes
Copilot reviewed 2 out of 5 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| aiter/configs/a8w8_blockscale_tuned_gemm.csv | (Per PR description) bulk tuning table updates for A8W8 blockscale GEMMs. |
| aiter/configs/tuned_fmoe.csv | (Per PR description) bulk tuning table updates for fused MoE kernel selection. |
| aiter/configs/model_configs/glm47_fp8_tuned_fmoe.csv | New GLM-4.7 FP8 tuned FMoE configs (cu_num=256). |
| aiter/configs/model_configs/glm47_fp8_untuned_fmoe.csv | New GLM-4.7 FP8 untuned reference shapes (input list for tuning). |
| aiter/configs/model_configs/kimik2_fp4_tuned_fmoe.csv | Updated Kimi K2.5 FP4 tuned FMoE configs (cu_num=256). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| 256,1,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0,32,0,24.9944,moe_ck2stages_gemm1_256x32x64x256_1x4_TypeCast_v1_Nswizzle0_Quant2_MulRoutedWeight0_silu_F8_F8_B16,0.00%,20.3787,moe_ck2stages_gemm2_256x32x128x256_1x4_TypeCast_v1_Nswizzle0_Quant2_MulRoutedWeight1_F8_F8_B16,3.80%,45.3731,0,8.32,20799.41 | ||
| 256,2,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0,32,0,40.11,_ZN5aiter48fmoe_stage1_bf16_pertokenFp8_g1u1_32x128_3tg_pf3E,0.00%,28.5666,moe_ck2stages_gemm2_256x32x128x256_1x4_TypeCast_v1_Nswizzle0_Quant2_MulRoutedWeight1_F8_F8_B16,3.80%,68.6766,0,10.99,13741.93 | ||
| 256,4,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0,32,0,99.8134,_ZN5aiter45fmoe_bf16_pertokenFp8_g1u1_vs_silu_1tg_32x192E,0.00%,0,,0.00%,99.8134,1,15.13,9455.44 | ||
| 256,8,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0,32,0,134.9101,_ZN5aiter45fmoe_bf16_pertokenFp8_g1u1_vs_silu_1tg_32x256E,0.00%,0,,0.00%,134.9101,1,22.38,6996.08 | ||
| 256,16,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0,32,0,156.6731,_ZN5aiter45fmoe_bf16_pertokenFp8_g1u1_vs_silu_1tg_32x256E,0.00%,0,,0.00%,156.6731,1,38.55,6025.06 | ||
| 256,32,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0,32,0,164.2209,_ZN5aiter45fmoe_bf16_pertokenFp8_g1u1_vs_silu_1tg_32x256E,0.00%,0,,0.00%,164.2209,1,73.56,5749.63 | ||
| 256,64,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0,32,0,168.1699,_ZN5aiter45fmoe_bf16_pertokenFp8_g1u1_vs_silu_1tg_32x256E,0.00%,0,,0.00%,168.1699,1,143.66,5617.54 | ||
| 256,128,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0,32,0,177.1016,_ZN5aiter48fmoe_bf16_pertokenFp8_g1u1_vs_silu_1tg_ps_32x384E,0.00%,0,,0.00%,177.1016,1,272.83,5339.79 | ||
| 256,256,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0,64,0,111.1445,_ZN5aiter48fmoe_stage1_bf16_pertokenFp8_g1u1_64x128_2tg_pf3E,0.00%,94.3821,moe_ck2stages_gemm2_256x64x128x256_1x4_TypeCast_v1_Nswizzle0_Quant2_MulRoutedWeight1_F8_F8_B16,3.80%,205.5266,0,470.19,4610.84 | ||
| 256,512,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0,128,0,128.617,_ZN5aiter45fmoe_stage1_bf16_pertokenFp8_g1u1_128x128_pf3E,0.00%,136.7373,moe_ck2stages_gemm2_256x128x128x128_1x4_TypeCast_v3_Nswizzle0_Quant2_MulRoutedWeight1_F8_F8_B16,3.80%,265.3543,0,728.36,3586.08 | ||
| 256,1024,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0,64,0,227.447,moe_ck2stages_gemm1_256x64x64x256_1x4_TypeCast_v1_Nswizzle0_Quant2_MulRoutedWeight0_silu_F8_F8_B16,0.00%,192.093,moe_ck2stages_gemm2_256x64x128x256_1x4_TypeCast_v1_Nswizzle0_Quant2_MulRoutedWeight1_F8_F8_B16,3.80%,419.54,0,921.36,2286.9 | ||
| 256,2048,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0,128,0,352.8768,moe_ck2stages_gemm1_256x128x64x128_1x4_TypeCast_v1_Nswizzle0_Quant2_MulRoutedWeight0_silu_F8_F8_B16,0.00%,345.4655,moe_ck2stages_gemm2_256x128x128x128_1x4_TypeCast_v3_Nswizzle0_Quant2_MulRoutedWeight1_F8_F8_B16,3.70%,698.3423,0,1107.04,1396.42 | ||
| 256,4096,5120,1536,40,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fn,torch.float8_e4m3fn,QuantType.per_Token,1,0,128,0,582.0723,moe_ck2stages_gemm1_256x128x128x128_1x4_TypeCast_v3_Nswizzle0_Quant2_MulRoutedWeight0_silu_F8_F8_B16,0.00%,641.9346,moe_ck2stages_gemm2_256x128x128x128_1x4_TypeCast_v3_Nswizzle0_Quant2_MulRoutedWeight1_F8_F8_B16,3.80%,1224.0069,0,1263.22,822.41 |
…iniMax-M2.5, DeepSeek-V3.2) (#3004) * Add GLM-4.7 FP8 tuned and untuned FMOE configs * Added MI355X MoE tunings for Kimi-K2 FP4 TP2 * Add MiniMax M25 A8W8 blockscale GEMM tunings * [configs] Add MI355X tuned GEMM and FMoE configs for DeepSeek-V3.2 Add gfx950 (MI355X, cu_num=256) tuning results for A8W8 block-scale GEMM and fused MoE kernels, optimized for DeepSeek-V3.2 shapes. GEMM (a8w8_blockscale_tuned_gemm.csv): - 6375 entries covering M=1..8192 for all DSv32 (N,K) shapes - Includes split-K tuned configs per shape (best of splitK=0 vs splitK>0) - Key decode (M=1) improvements: 128x7168 -59%, 7168x4096 -33% FMoE (tuned_fmoe.csv): - 802 cu_num=256 entries for DSv32 expert dimensions (N=512/4096/4608/7168, K=1536/7168/9216) - Replaces 751 previous cu_num=256 entries with re-tuned results - Existing cu_num=80 (MI300X) entries unchanged Made-with: Cursor * Add MiniMax M25 FMoE tunings * fix: dedup 1692 duplicate entries in tuned_fmoe.csv from merge * fix: remove 446 shapes from main CSV that duplicate ds_v3 model config * fix: remove FMoE shapes that duplicate model-specific configs --------- Co-authored-by: Olga Miroshnichenko <olga.miroshnichenko@amd.com> Co-authored-by: Xavier Aguilar <xavier.aguilarfruto@amd.com> Co-authored-by: Aakif Nawaz <aakif.nawaz@amd.com> Co-authored-by: frida-andersson <fanderss@amd.com>
Squash-merged from main commit 52c4554. Includes 5 atomic Silo PRs: - #2923 GLM-4.7 FP8 tuned/untuned FMoE configs (new) - #2938 Kimi-K2.5 FP4 fused MoE tunings (TP2 / 256 CU refresh) - #2979 MiniMax-M2.5 A8W8 blockscale GEMM tunings - #2981 DeepSeek-V3.2 MI355X tuned GEMM and FMoE configs - #2982 MiniMax-M2.5 FMoE tunings Conflict in aiter/configs/model_configs/kimik2_fp4_tuned_fmoe.csv: two blocks resolved by taking theirs (Silo). Block 1 upgrades existing M=256/N=512 rows from base kernel suffixes (w3) to tuner-discovered variants (w3_xcd4, _bnt2_persist, _sbm32, _sbm64). Block 2 is purely additive: 30+ new rows for previously-uncovered N=7168/K=1024 shapes plus a flydsl_fallback section. Driver: vLLM 0.21 freeze 2026-05-08 — Silo customers need these tunings on the AITER release wheel, not nightly. Verification gate before tag: - Kernel suffix parser smoke (Kimi-K2.5-MXFP4 1-token inference, confirm new suffixes JIT-compile without falling back) - ATOM 5-model accuracy unchanged within +/- 0.005 vs v0.1.13-rc1 - Perf delta on Kimi-K2.5 / MiniMax-M2.5 / DSv3.2 (expect flat or better) (cherry picked from commit 52c4554)
|
Follow-up note for #3004: please see my comment on #2982 here: #2982 (comment) The #2982 branch includes CK two-stage instance additions that were not included in this config-only bulk merge. I was not sure if a seperate PR was needed just for a few additions so I included it in my tunings PR. Apolgies I understand now it was intended to be a pure config only PR, let me know if you need a separate PR on the instances |
Add gfx950 (MI355X, cu_num=256) tuning results for A8W8 block-scale GEMM and fused MoE kernels, optimized for DeepSeek-V3.2 shapes. GEMM (a8w8_blockscale_tuned_gemm.csv): - 6375 entries covering M=1..8192 for all DSv32 (N,K) shapes - Includes split-K tuned configs per shape (best of splitK=0 vs splitK>0) - Key decode (M=1) improvements: 128x7168 -59%, 7168x4096 -33% FMoE (tuned_fmoe.csv): - 802 cu_num=256 entries for DSv32 expert dimensions (N=512/4096/4608/7168, K=1536/7168/9216) - Replaces 751 previous cu_num=256 entries with re-tuned results - Existing cu_num=80 (MI300X) entries unchanged Made-with: Cursor (Cherry-picked from #2981 to restore content lost in bulk merge #3004. Net semantic effect of this PR vs current main: GEMM: +6375 / -5 FMoE: +57 / -7 The remaining 1042 of #2981's 1099 textual FMoE adds are content- identical reorderings already present on main.)
* Add MiniMax M25 A8W8 blockscale GEMM tunings on gfx950 (splitK + AQRowMajor) * instances * Add MiniMax M25 FMoE cleaned entries * [configs] Add MI355X tuned GEMM and FMoE configs for DeepSeek-V3.2 Add gfx950 (MI355X, cu_num=256) tuning results for A8W8 block-scale GEMM and fused MoE kernels, optimized for DeepSeek-V3.2 shapes. GEMM (a8w8_blockscale_tuned_gemm.csv): - 6375 entries covering M=1..8192 for all DSv32 (N,K) shapes - Includes split-K tuned configs per shape (best of splitK=0 vs splitK>0) - Key decode (M=1) improvements: 128x7168 -59%, 7168x4096 -33% FMoE (tuned_fmoe.csv): - 802 cu_num=256 entries for DSv32 expert dimensions (N=512/4096/4608/7168, K=1536/7168/9216) - Replaces 751 previous cu_num=256 entries with re-tuned results - Existing cu_num=80 (MI300X) entries unchanged Made-with: Cursor (Cherry-picked from #2981 to restore content lost in bulk merge #3004. Net semantic effect of this PR vs current main: GEMM: +6375 / -5 FMoE: +57 / -7 The remaining 1042 of #2981's 1099 textual FMoE adds are content- identical reorderings already present on main.) * [configs] dedup colliding (M,N,K,cu_num,gfx) shapes between #2981 and ds_v3 #2981 added 223 DSv3.2 GEMM tunings to a8w8_blockscale_tuned_gemm.csv that share (M,N,K,cu_num,gfx) shape keys with the model-specific a8w8_blockscale_tuned_gemm_ds_v3.csv. The aiter build asserts no shape collisions across merged config files; resolve by keeping the lowest 'us' row per shape: - 187 ds_v3 rows dropped (superseded by #2981's better tunings) - 36 #2981 rows dropped (ds_v3's existing tunings were faster) Result: every conflicting shape still has a tuning, picked from whichever file had the better measurement. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [configs] dedup colliding FMoE shapes across global + model_configs files Same lowest-'us'-wins resolution as the GEMM dedup, applied to FMoE. The build's update_config_files asserts no shape collisions across merged tuned_fmoe files (key = untuned_fmoe.csv columns + cu_num + _tag); the additions in this PR introduced 65 cross-file collisions between tuned_fmoe.csv and 4 model_configs files. Resolution (best 'us' per shape): - tuned_fmoe.csv: 1080 -> 1039 rows (lost 41 to model files with better existing tunings — mostly 26 minimax + 10 glm47 + 4 ds_v3 + 1 qwen3_235b) - a8w8_blockscale_tuned_fmoe_ds_v3.csv: 16 -> 4 (12 superseded by #2981) - a8w8_blockscale_tuned_fmoe_minimax-m2_5.csv: 32 -> 26 (6 superseded by #2982) - a8w8_blockscale_tuned_fmoe_qwen3_235b.csv: 32 -> 28 (4 superseded by #2981) - glm47_fp8_tuned_fmoe.csv: 16 -> 14 (2 superseded by #2981) Every shape contributed by #2981 and #2982 remains covered post-dedup — where their row was not the winner, the existing model-specific tuning wins on its own merits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [configs] Move MiniMax FMoE tunings to model config Keep the surviving MiniMax M25 FMoE rows in the model-specific config file instead of the global tuned_fmoe table. * [configs] Move DeepSeek tunings to model configs Keep the surviving DeepSeek V3.2 tuning rows in model-specific config files instead of the global tuning tables. * [configs] Move GLM FMoE rows back to GLM config Address likely MiniMax tuning contamination by moving the per-token GLM-4.7 FMoE entries back to the GLM model config where they belong. * [moe] Disable blockscale GEMM2 instance that exceeds LDS Avoid prebuilding the 256x128x128x128 2x2 blockscale GEMM2 candidate, which exceeds the local memory limit during JIT compilation. * [moe] Disable risky/unused blockscale MoE instances --------- Co-authored-by: Aakif Nawaz <aaknawaz@amd.com> Co-authored-by: Aakif Nawaz <aakif.nawaz@amd.com> Co-authored-by: frida-andersson <fanderss@amd.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…iniMax-M2.5, DeepSeek-V3.2) (#3004) * Add GLM-4.7 FP8 tuned and untuned FMOE configs * Added MI355X MoE tunings for Kimi-K2 FP4 TP2 * Add MiniMax M25 A8W8 blockscale GEMM tunings * [configs] Add MI355X tuned GEMM and FMoE configs for DeepSeek-V3.2 Add gfx950 (MI355X, cu_num=256) tuning results for A8W8 block-scale GEMM and fused MoE kernels, optimized for DeepSeek-V3.2 shapes. GEMM (a8w8_blockscale_tuned_gemm.csv): - 6375 entries covering M=1..8192 for all DSv32 (N,K) shapes - Includes split-K tuned configs per shape (best of splitK=0 vs splitK>0) - Key decode (M=1) improvements: 128x7168 -59%, 7168x4096 -33% FMoE (tuned_fmoe.csv): - 802 cu_num=256 entries for DSv32 expert dimensions (N=512/4096/4608/7168, K=1536/7168/9216) - Replaces 751 previous cu_num=256 entries with re-tuned results - Existing cu_num=80 (MI300X) entries unchanged Made-with: Cursor * Add MiniMax M25 FMoE tunings * fix: dedup 1692 duplicate entries in tuned_fmoe.csv from merge * fix: remove 446 shapes from main CSV that duplicate ds_v3 model config * fix: remove FMoE shapes that duplicate model-specific configs --------- Co-authored-by: Olga Miroshnichenko <olga.miroshnichenko@amd.com> Co-authored-by: Xavier Aguilar <xavier.aguilarfruto@amd.com> Co-authored-by: Aakif Nawaz <aakif.nawaz@amd.com> Co-authored-by: frida-andersson <fanderss@amd.com>
* Add MiniMax M25 A8W8 blockscale GEMM tunings on gfx950 (splitK + AQRowMajor) * instances * Add MiniMax M25 FMoE cleaned entries * [configs] Add MI355X tuned GEMM and FMoE configs for DeepSeek-V3.2 Add gfx950 (MI355X, cu_num=256) tuning results for A8W8 block-scale GEMM and fused MoE kernels, optimized for DeepSeek-V3.2 shapes. GEMM (a8w8_blockscale_tuned_gemm.csv): - 6375 entries covering M=1..8192 for all DSv32 (N,K) shapes - Includes split-K tuned configs per shape (best of splitK=0 vs splitK>0) - Key decode (M=1) improvements: 128x7168 -59%, 7168x4096 -33% FMoE (tuned_fmoe.csv): - 802 cu_num=256 entries for DSv32 expert dimensions (N=512/4096/4608/7168, K=1536/7168/9216) - Replaces 751 previous cu_num=256 entries with re-tuned results - Existing cu_num=80 (MI300X) entries unchanged Made-with: Cursor (Cherry-picked from #2981 to restore content lost in bulk merge #3004. Net semantic effect of this PR vs current main: GEMM: +6375 / -5 FMoE: +57 / -7 The remaining 1042 of #2981's 1099 textual FMoE adds are content- identical reorderings already present on main.) * [configs] dedup colliding (M,N,K,cu_num,gfx) shapes between #2981 and ds_v3 #2981 added 223 DSv3.2 GEMM tunings to a8w8_blockscale_tuned_gemm.csv that share (M,N,K,cu_num,gfx) shape keys with the model-specific a8w8_blockscale_tuned_gemm_ds_v3.csv. The aiter build asserts no shape collisions across merged config files; resolve by keeping the lowest 'us' row per shape: - 187 ds_v3 rows dropped (superseded by #2981's better tunings) - 36 #2981 rows dropped (ds_v3's existing tunings were faster) Result: every conflicting shape still has a tuning, picked from whichever file had the better measurement. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [configs] dedup colliding FMoE shapes across global + model_configs files Same lowest-'us'-wins resolution as the GEMM dedup, applied to FMoE. The build's update_config_files asserts no shape collisions across merged tuned_fmoe files (key = untuned_fmoe.csv columns + cu_num + _tag); the additions in this PR introduced 65 cross-file collisions between tuned_fmoe.csv and 4 model_configs files. Resolution (best 'us' per shape): - tuned_fmoe.csv: 1080 -> 1039 rows (lost 41 to model files with better existing tunings — mostly 26 minimax + 10 glm47 + 4 ds_v3 + 1 qwen3_235b) - a8w8_blockscale_tuned_fmoe_ds_v3.csv: 16 -> 4 (12 superseded by #2981) - a8w8_blockscale_tuned_fmoe_minimax-m2_5.csv: 32 -> 26 (6 superseded by #2982) - a8w8_blockscale_tuned_fmoe_qwen3_235b.csv: 32 -> 28 (4 superseded by #2981) - glm47_fp8_tuned_fmoe.csv: 16 -> 14 (2 superseded by #2981) Every shape contributed by #2981 and #2982 remains covered post-dedup — where their row was not the winner, the existing model-specific tuning wins on its own merits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [configs] Move MiniMax FMoE tunings to model config Keep the surviving MiniMax M25 FMoE rows in the model-specific config file instead of the global tuned_fmoe table. * [configs] Move DeepSeek tunings to model configs Keep the surviving DeepSeek V3.2 tuning rows in model-specific config files instead of the global tuning tables. * [configs] Move GLM FMoE rows back to GLM config Address likely MiniMax tuning contamination by moving the per-token GLM-4.7 FMoE entries back to the GLM model config where they belong. * [moe] Disable blockscale GEMM2 instance that exceeds LDS Avoid prebuilding the 256x128x128x128 2x2 blockscale GEMM2 candidate, which exceeds the local memory limit during JIT compilation. * [moe] Disable risky/unused blockscale MoE instances --------- Co-authored-by: Aakif Nawaz <aaknawaz@amd.com> Co-authored-by: Aakif Nawaz <aakif.nawaz@amd.com> Co-authored-by: frida-andersson <fanderss@amd.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Bulk merge of 5 Silo config-only PRs for v0.1.13 deadline (vLLM 0.21 freeze 2026-05-08).
Pure CSV tuning configs — no code changes.
Included PRs
Files Changed
All under
aiter/configs/:a8w8_blockscale_tuned_gemm.csvtuned_fmoe.csvmodel_configs/glm47_fp8_tuned_fmoe.csv(new)model_configs/glm47_fp8_untuned_fmoe.csv(new)model_configs/kimik2_fp4_tuned_fmoe.csvRisk
Low — CSV config files only. No kernel code, no API changes, no build changes.
Test Plan