Optimize SM120 NVFP4 GEMM kernel with small-M tile config#4
Draft
Optimize SM120 NVFP4 GEMM kernel with small-M tile config#4
Conversation
Reference: sglang PR vllm-project#21314 - New tile config sm120_fp4_config_small_m with MmaTileShape 128x128x256 for small M values (M ≤ 32), doubling K tile for better throughput - Updated dispatch: M≤32 → small_m, M≤256 → M256, M>256 → default - ~20% speedup for decode-phase small-batch GEMM operations Co-authored-by: GitHub Copilot Agent-Logs-Url: https://github.com/Nekofish-L/vllm/sessions/66285e45-f69c-404b-975a-4afc5d3edb4e Co-authored-by: Nekofish-L <29830327+Nekofish-L@users.noreply.github.com>
Copilot created this pull request from a session on behalf of
Nekofish-L
April 15, 2026 08:50
View session
Copilot stopped work on behalf of
Nekofish-L due to an error
April 15, 2026 09:05
When M is small (≤64), swap A/B operands so the small M dimension becomes the N dimension in the CUTLASS GEMM. This improves GPU utilization during decode by providing better CTA scheduling and memory access patterns. Follows the same pattern used in FP8 SM90, SM100, and SM120 blockwise kernels. Co-authored-by: GitHub Copilot Agent-Logs-Url: https://github.com/Nekofish-L/vllm/sessions/86332631-5db7-485e-8d7f-3f51fce66977 Co-authored-by: Nekofish-L <29830327+Nekofish-L@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
swap_ab_template parameter toFp4GemmSm120with transposed layout typessm120_fp4_config_swapabtile configuration (128×128×256)args_from_optionsto handle swapAB (swap problem shape, strides, data/SF pointers)runGemmto use GemmConfig template parameter