Skip to content

fix(attention): add SM120 block size configuration for extend attention#17908

Closed
magik6k wants to merge 1 commit intosgl-project:mainfrom
magik6k:sm120-extend-attention-fix
Closed

fix(attention): add SM120 block size configuration for extend attention#17908
magik6k wants to merge 1 commit intosgl-project:mainfrom
magik6k:sm120-extend-attention-fix

Conversation

@magik6k
Copy link
Copy Markdown

@magik6k magik6k commented Jan 29, 2026

Personal note: I will not have much time to respond to comments here, but wanted to contribute this fix as it did truly get the models going for me. I have no solid understanding what the fix actually does but it does work on 8x RTX6000 Blackwell.

More detailed run artifacts in https://github.com/magik6k/glm-kimi-sm120-rtx6000bw

Thanks Opus / OpenCode / Exa..

SM120 (consumer/workstation Blackwell) has significantly less shared memory than SM100 (datacenter Blackwell):

  • SM120: ~100KB shared memory (RTX 5090, RTX PRO 6000)
  • SM100: ~228KB shared memory (B100, B200)
    The current code path for CUDA_CAPABILITY[0] >= 9 uses Hopper-sized blocks (128, 64) requiring ~106KB shared memory, which exceeds SM120's limit:
    triton.runtime.errors.OutOfResources: out of resource: shared memory,
    Required: 106496, Hardware limit: 101376
    This directly addresses Issue [Bug] Kimi k2 crashes sglang after first request on sm120 #14322 (Kimi K2 crashes on SM120).

Changes

  • Add explicit SM120 detection (CUDA_CAPABILITY[0] == 12) before the SM90+ check
  • Use smaller block sizes (64,64), (32,64), (32,32) similar to sm86/sm89 Ampere which also has ~100KB shared memory
  • Update comment to clarify the existing path covers SM90 Hopper and SM100 datacenter Blackwell

Testing

Tested on 8x RTX PRO 6000 Blackwell Server Edition (SM120, 96GB each):

Model Quantization Peak Throughput
Kimi K2-Thinking NVFP4 5,816 tok/s @ 2048 concurrency (?? that's the total? Out was 1.6k iiuc)
Kimi K2.5 INT4 (AWQ) 985 tok/s @ 1000 concurrency
Before this fix, both models crashed immediately after the first request with the shared memory error.

Related

Add SM120 (Blackwell RTX) specific block size configuration to fix shared
memory exhaustion on consumer/workstation Blackwell GPUs (RTX 5090, RTX PRO
6000, etc.).

SM120 has only ~100KB shared memory per SM, compared to 228KB on SM100
(datacenter Blackwell like B100/B200). The existing code matched SM120 via
the `CUDA_CAPABILITY[0] >= 9` check but used Hopper-sized blocks (128, 64)
that require ~106KB, exceeding the hardware limit.

This fix adds an explicit SM120 case with smaller block sizes (64,64),
(32,64), or (32,32) depending on head dimension, similar to sm86/sm89 Ampere
which also has ~100KB shared memory.

Tested on 8x RTX PRO 6000 Blackwell with:
- Kimi K2-Thinking NVFP4: 5,816 tok/s peak throughput
- Kimi K2.5 INT4: 985 tok/s peak throughput

Fixes sgl-project#14322
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @magik6k, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a critical fix for running attention kernels on NVIDIA SM120 Blackwell GPUs, such as the RTX PRO 6000. By adding specific configurations for SM120, which has less shared memory than its datacenter counterparts, the change prevents out-of-resource errors and enables the successful execution of models like Kimi K2 and Kimi K2.5, significantly improving compatibility and performance on these workstation-class GPUs.

Highlights

  • SM120 Specific Configuration: Introduced a new code path to specifically handle NVIDIA SM120 Blackwell GPUs (e.g., RTX 5090, RTX PRO 6000) by checking for CUDA_CAPABILITY[0] == 12.
  • Optimized Block Sizes: Implemented smaller block sizes (64,64), (32,64), and (32,32) for SM120 based on query length (Lq) values, similar to Ampere (sm86/sm89) architectures, to accommodate its limited shared memory (~100KB).
  • Comment Clarification: Updated an existing comment to explicitly state that the CUDA_CAPABILITY[0] >= 9 path now covers both Hopper (SM90) and SM100 datacenter Blackwell GPUs.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request provides an important fix for running models on SM120 (consumer Blackwell) GPUs by adjusting block sizes in the extend attention kernel to prevent shared memory errors. The change is clear, well-documented, and includes test results, which is great. I've added one suggestion to refactor the conditional logic slightly to improve code structure and reduce duplication. Overall, this is a valuable contribution.

Comment on lines +66 to 79
elif _is_cuda and CUDA_CAPABILITY[0] == 12:
# SM120 Blackwell RTX (RTX 5090, RTX PRO 6000, etc.)
# Consumer/workstation Blackwell has only ~100KB shared memory
# (vs 228KB on SM100 datacenter Blackwell like B100/B200)
# Use smaller block sizes similar to sm86/sm89 Ampere which also has ~100KB
if Lq <= 128:
BLOCK_M, BLOCK_N = (64, 64)
elif Lq <= 256:
BLOCK_M, BLOCK_N = (32, 64)
else:
BLOCK_M, BLOCK_N = (32, 32)
num_warps = 4 if Lq <= 64 else 8
else:
if _is_cuda and CUDA_CAPABILITY[0] >= 9:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This new elif block for SM120 introduces some code duplication. The num_warps assignment on line 77 is identical to the one on line 106 for other CUDA architectures. Additionally, the _is_cuda check is now present in multiple branches.

We can refactor this logic to be more streamlined and avoid duplication by nesting the CUDA-specific checks under a single else block. This improves maintainability by having a single point of assignment for num_warps for all CUDA architectures.

Suggested change
elif _is_cuda and CUDA_CAPABILITY[0] == 12:
# SM120 Blackwell RTX (RTX 5090, RTX PRO 6000, etc.)
# Consumer/workstation Blackwell has only ~100KB shared memory
# (vs 228KB on SM100 datacenter Blackwell like B100/B200)
# Use smaller block sizes similar to sm86/sm89 Ampere which also has ~100KB
if Lq <= 128:
BLOCK_M, BLOCK_N = (64, 64)
elif Lq <= 256:
BLOCK_M, BLOCK_N = (32, 64)
else:
BLOCK_M, BLOCK_N = (32, 32)
num_warps = 4 if Lq <= 64 else 8
else:
if _is_cuda and CUDA_CAPABILITY[0] >= 9:
else:
if _is_cuda and CUDA_CAPABILITY[0] == 12:
# SM120 Blackwell RTX (RTX 5090, RTX PRO 6000, etc.)
# Consumer/workstation Blackwell has only ~100KB shared memory
# (vs 228KB on SM100 datacenter Blackwell like B100/B200)
# Use smaller block sizes similar to sm86/sm89 Ampere which also has ~100KB
if Lq <= 128:
BLOCK_M, BLOCK_N = (64, 64)
elif Lq <= 256:
BLOCK_M, BLOCK_N = (32, 64)
else:
BLOCK_M, BLOCK_N = (32, 32)
elif _is_cuda and CUDA_CAPABILITY[0] >= 9:

@magik6k
Copy link
Copy Markdown
Author

magik6k commented Jan 31, 2026

fixed by #14311

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Kimi k2 crashes sglang after first request on sm120

1 participant