fix(attention): add SM120 block size configuration for extend attention#17908
fix(attention): add SM120 block size configuration for extend attention#17908magik6k wants to merge 1 commit intosgl-project:mainfrom
Conversation
Add SM120 (Blackwell RTX) specific block size configuration to fix shared memory exhaustion on consumer/workstation Blackwell GPUs (RTX 5090, RTX PRO 6000, etc.). SM120 has only ~100KB shared memory per SM, compared to 228KB on SM100 (datacenter Blackwell like B100/B200). The existing code matched SM120 via the `CUDA_CAPABILITY[0] >= 9` check but used Hopper-sized blocks (128, 64) that require ~106KB, exceeding the hardware limit. This fix adds an explicit SM120 case with smaller block sizes (64,64), (32,64), or (32,32) depending on head dimension, similar to sm86/sm89 Ampere which also has ~100KB shared memory. Tested on 8x RTX PRO 6000 Blackwell with: - Kimi K2-Thinking NVFP4: 5,816 tok/s peak throughput - Kimi K2.5 INT4: 985 tok/s peak throughput Fixes sgl-project#14322
Summary of ChangesHello @magik6k, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a critical fix for running attention kernels on NVIDIA SM120 Blackwell GPUs, such as the RTX PRO 6000. By adding specific configurations for SM120, which has less shared memory than its datacenter counterparts, the change prevents out-of-resource errors and enables the successful execution of models like Kimi K2 and Kimi K2.5, significantly improving compatibility and performance on these workstation-class GPUs. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request provides an important fix for running models on SM120 (consumer Blackwell) GPUs by adjusting block sizes in the extend attention kernel to prevent shared memory errors. The change is clear, well-documented, and includes test results, which is great. I've added one suggestion to refactor the conditional logic slightly to improve code structure and reduce duplication. Overall, this is a valuable contribution.
| elif _is_cuda and CUDA_CAPABILITY[0] == 12: | ||
| # SM120 Blackwell RTX (RTX 5090, RTX PRO 6000, etc.) | ||
| # Consumer/workstation Blackwell has only ~100KB shared memory | ||
| # (vs 228KB on SM100 datacenter Blackwell like B100/B200) | ||
| # Use smaller block sizes similar to sm86/sm89 Ampere which also has ~100KB | ||
| if Lq <= 128: | ||
| BLOCK_M, BLOCK_N = (64, 64) | ||
| elif Lq <= 256: | ||
| BLOCK_M, BLOCK_N = (32, 64) | ||
| else: | ||
| BLOCK_M, BLOCK_N = (32, 32) | ||
| num_warps = 4 if Lq <= 64 else 8 | ||
| else: | ||
| if _is_cuda and CUDA_CAPABILITY[0] >= 9: |
There was a problem hiding this comment.
This new elif block for SM120 introduces some code duplication. The num_warps assignment on line 77 is identical to the one on line 106 for other CUDA architectures. Additionally, the _is_cuda check is now present in multiple branches.
We can refactor this logic to be more streamlined and avoid duplication by nesting the CUDA-specific checks under a single else block. This improves maintainability by having a single point of assignment for num_warps for all CUDA architectures.
| elif _is_cuda and CUDA_CAPABILITY[0] == 12: | |
| # SM120 Blackwell RTX (RTX 5090, RTX PRO 6000, etc.) | |
| # Consumer/workstation Blackwell has only ~100KB shared memory | |
| # (vs 228KB on SM100 datacenter Blackwell like B100/B200) | |
| # Use smaller block sizes similar to sm86/sm89 Ampere which also has ~100KB | |
| if Lq <= 128: | |
| BLOCK_M, BLOCK_N = (64, 64) | |
| elif Lq <= 256: | |
| BLOCK_M, BLOCK_N = (32, 64) | |
| else: | |
| BLOCK_M, BLOCK_N = (32, 32) | |
| num_warps = 4 if Lq <= 64 else 8 | |
| else: | |
| if _is_cuda and CUDA_CAPABILITY[0] >= 9: | |
| else: | |
| if _is_cuda and CUDA_CAPABILITY[0] == 12: | |
| # SM120 Blackwell RTX (RTX 5090, RTX PRO 6000, etc.) | |
| # Consumer/workstation Blackwell has only ~100KB shared memory | |
| # (vs 228KB on SM100 datacenter Blackwell like B100/B200) | |
| # Use smaller block sizes similar to sm86/sm89 Ampere which also has ~100KB | |
| if Lq <= 128: | |
| BLOCK_M, BLOCK_N = (64, 64) | |
| elif Lq <= 256: | |
| BLOCK_M, BLOCK_N = (32, 64) | |
| else: | |
| BLOCK_M, BLOCK_N = (32, 32) | |
| elif _is_cuda and CUDA_CAPABILITY[0] >= 9: |
|
fixed by #14311 |
Personal note: I will not have much time to respond to comments here, but wanted to contribute this fix as it did truly get the models going for me. I have no solid understanding what the fix actually does but it does work on 8x RTX6000 Blackwell.
More detailed run artifacts in https://github.com/magik6k/glm-kimi-sm120-rtx6000bw
Thanks Opus / OpenCode / Exa..
SM120 (consumer/workstation Blackwell) has significantly less shared memory than SM100 (datacenter Blackwell):
The current code path for
CUDA_CAPABILITY[0] >= 9uses Hopper-sized blocks (128, 64) requiring ~106KB shared memory, which exceeds SM120's limit:triton.runtime.errors.OutOfResources: out of resource: shared memory,
Required: 106496, Hardware limit: 101376
This directly addresses Issue [Bug] Kimi k2 crashes sglang after first request on sm120 #14322 (Kimi K2 crashes on SM120).
Changes
CUDA_CAPABILITY[0] == 12) before the SM90+ checkTesting
Tested on 8x RTX PRO 6000 Blackwell Server Edition (SM120, 96GB each):
Related