fix(attention): add SM120 block size configuration for extend attention by magik6k · Pull Request #17908 · sgl-project/sglang

magik6k · 2026-01-29T02:20:07Z

Personal note: I will not have much time to respond to comments here, but wanted to contribute this fix as it did truly get the models going for me. I have no solid understanding what the fix actually does but it does work on 8x RTX6000 Blackwell.

More detailed run artifacts in https://github.com/magik6k/glm-kimi-sm120-rtx6000bw

Thanks Opus / OpenCode / Exa..

SM120 (consumer/workstation Blackwell) has significantly less shared memory than SM100 (datacenter Blackwell):

SM120: ~100KB shared memory (RTX 5090, RTX PRO 6000)
SM100: ~228KB shared memory (B100, B200)
The current code path for CUDA_CAPABILITY[0] >= 9 uses Hopper-sized blocks (128, 64) requiring ~106KB shared memory, which exceeds SM120's limit:
triton.runtime.errors.OutOfResources: out of resource: shared memory,
Required: 106496, Hardware limit: 101376
This directly addresses Issue [Bug] Kimi k2 crashes sglang after first request on sm120 #14322 (Kimi K2 crashes on SM120).

Changes

Add explicit SM120 detection (CUDA_CAPABILITY[0] == 12) before the SM90+ check
Use smaller block sizes (64,64), (32,64), (32,32) similar to sm86/sm89 Ampere which also has ~100KB shared memory
Update comment to clarify the existing path covers SM90 Hopper and SM100 datacenter Blackwell

Testing

Tested on 8x RTX PRO 6000 Blackwell Server Edition (SM120, 96GB each):

Model	Quantization	Peak Throughput
Kimi K2-Thinking	NVFP4	5,816 tok/s @ 2048 concurrency (?? that's the total? Out was 1.6k iiuc)
Kimi K2.5	INT4 (AWQ)	985 tok/s @ 1000 concurrency
Before this fix, both models crashed immediately after the first request with the shared memory error.

Add SM120 (Blackwell RTX) specific block size configuration to fix shared memory exhaustion on consumer/workstation Blackwell GPUs (RTX 5090, RTX PRO 6000, etc.). SM120 has only ~100KB shared memory per SM, compared to 228KB on SM100 (datacenter Blackwell like B100/B200). The existing code matched SM120 via the `CUDA_CAPABILITY[0] >= 9` check but used Hopper-sized blocks (128, 64) that require ~106KB, exceeding the hardware limit. This fix adds an explicit SM120 case with smaller block sizes (64,64), (32,64), or (32,32) depending on head dimension, similar to sm86/sm89 Ampere which also has ~100KB shared memory. Tested on 8x RTX PRO 6000 Blackwell with: - Kimi K2-Thinking NVFP4: 5,816 tok/s peak throughput - Kimi K2.5 INT4: 985 tok/s peak throughput Fixes sgl-project#14322

gemini-code-assist · 2026-01-29T02:20:25Z

Summary of Changes

Hello @magik6k, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a critical fix for running attention kernels on NVIDIA SM120 Blackwell GPUs, such as the RTX PRO 6000. By adding specific configurations for SM120, which has less shared memory than its datacenter counterparts, the change prevents out-of-resource errors and enables the successful execution of models like Kimi K2 and Kimi K2.5, significantly improving compatibility and performance on these workstation-class GPUs.

Highlights

SM120 Specific Configuration: Introduced a new code path to specifically handle NVIDIA SM120 Blackwell GPUs (e.g., RTX 5090, RTX PRO 6000) by checking for CUDA_CAPABILITY[0] == 12.
Optimized Block Sizes: Implemented smaller block sizes (64,64), (32,64), and (32,32) for SM120 based on query length (Lq) values, similar to Ampere (sm86/sm89) architectures, to accommodate its limited shared memory (~100KB).
Comment Clarification: Updated an existing comment to explicitly state that the CUDA_CAPABILITY[0] >= 9 path now covers both Hopper (SM90) and SM100 datacenter Blackwell GPUs.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request provides an important fix for running models on SM120 (consumer Blackwell) GPUs by adjusting block sizes in the extend attention kernel to prevent shared memory errors. The change is clear, well-documented, and includes test results, which is great. I've added one suggestion to refactor the conditional logic slightly to improve code structure and reduce duplication. Overall, this is a valuable contribution.

gemini-code-assist · 2026-01-29T02:33:30Z

python/sglang/srt/layers/attention/triton_ops/extend_attention.py

+    elif _is_cuda and CUDA_CAPABILITY[0] == 12:
+        # SM120 Blackwell RTX (RTX 5090, RTX PRO 6000, etc.)
+        # Consumer/workstation Blackwell has only ~100KB shared memory
+        # (vs 228KB on SM100 datacenter Blackwell like B100/B200)
+        # Use smaller block sizes similar to sm86/sm89 Ampere which also has ~100KB
+        if Lq <= 128:
+            BLOCK_M, BLOCK_N = (64, 64)
+        elif Lq <= 256:
+            BLOCK_M, BLOCK_N = (32, 64)
+        else:
+            BLOCK_M, BLOCK_N = (32, 32)
+        num_warps = 4 if Lq <= 64 else 8
    else:
        if _is_cuda and CUDA_CAPABILITY[0] >= 9:


This new elif block for SM120 introduces some code duplication. The num_warps assignment on line 77 is identical to the one on line 106 for other CUDA architectures. Additionally, the _is_cuda check is now present in multiple branches.

We can refactor this logic to be more streamlined and avoid duplication by nesting the CUDA-specific checks under a single else block. This improves maintainability by having a single point of assignment for num_warps for all CUDA architectures.

Suggested change

elif _is_cuda and CUDA_CAPABILITY[0] == 12:

# SM120 Blackwell RTX (RTX 5090, RTX PRO 6000, etc.)

# Consumer/workstation Blackwell has only ~100KB shared memory

# (vs 228KB on SM100 datacenter Blackwell like B100/B200)

# Use smaller block sizes similar to sm86/sm89 Ampere which also has ~100KB

if Lq <= 128:

BLOCK_M, BLOCK_N = (64, 64)

elif Lq <= 256:

BLOCK_M, BLOCK_N = (32, 64)

else:

BLOCK_M, BLOCK_N = (32, 32)

num_warps = 4 if Lq <= 64 else 8

else:

if _is_cuda and CUDA_CAPABILITY[0] >= 9:

else:

if _is_cuda and CUDA_CAPABILITY[0] == 12:

# SM120 Blackwell RTX (RTX 5090, RTX PRO 6000, etc.)

# Consumer/workstation Blackwell has only ~100KB shared memory

# (vs 228KB on SM100 datacenter Blackwell like B100/B200)

# Use smaller block sizes similar to sm86/sm89 Ampere which also has ~100KB

if Lq <= 128:

BLOCK_M, BLOCK_N = (64, 64)

elif Lq <= 256:

BLOCK_M, BLOCK_N = (32, 64)

else:

BLOCK_M, BLOCK_N = (32, 32)

elif _is_cuda and CUDA_CAPABILITY[0] >= 9:

magik6k · 2026-01-31T18:01:09Z

fixed by #14311

magik6k requested review from Fridge003, Qiaolin-Yu, hebiao064, ispobock and merrymercy as code owners January 29, 2026 02:20

magik6k mentioned this pull request Jan 29, 2026

[Bug] Kimi k2 crashes sglang after first request on sm120 #14322

Closed

5 tasks

gemini-code-assist bot reviewed Jan 29, 2026

View reviewed changes

magik6k mentioned this pull request Jan 29, 2026

[Bug]: Regression in v0.14.0: "No valid attention backend found" for nvidia/DeepSeek-R1-0528-NVFP4 on RTX Pro 6000 (Blackwell) vllm-project/vllm#32732

Open

1 task

magik6k closed this Jan 31, 2026

stewtong mentioned this pull request Mar 1, 2026

[Benchmark] Qwen3.5-122B-A10B FP8 weights / bf16 KV on 8x RTX PRO 6000 (SM120): 1,985 tok/s burst, MTP 2.75x, fp8 KV silent corruption finding #19603

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(attention): add SM120 block size configuration for extend attention#17908

fix(attention): add SM120 block size configuration for extend attention#17908
magik6k wants to merge 1 commit intosgl-project:mainfrom
magik6k:sm120-extend-attention-fix

magik6k commented Jan 29, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Jan 29, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 29, 2026

Uh oh!

magik6k commented Jan 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

magik6k commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Testing

Related

Uh oh!

gemini-code-assist bot commented Jan 29, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

magik6k commented Jan 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

magik6k commented Jan 29, 2026 •

edited

Loading