Optimize GDN decode for Qwen3 Next by samuellees · Pull Request #17094 · sgl-project/sglang

samuellees · 2026-01-15T02:10:42Z

Motivation

As MoE/FullAttn/GEMM has been optimized, the GDN kernel becomes a bottleneck for Q3N. For example, decode linear attention takes about 15~23% with cc=64/256 with ISL=1k in NVFP4 inference on B200.

This PR apply an optimize on GDN decode kernel, which brings significant benefits e2e.

Ref: vllm-project/vllm#31722

cc @yizhang2077

Modifications

Accuracy Tests and Benchmarking

TEP4, MTP OFF, B200
It gives about 10% e2e improvements on gsm8k dataset.

# Before
Accuracy: 0.945
Invalid: 0.000
Latency: 74.870 s
Output throughput: 2967.035 token/s

# After
Accuracy: 0.943
Invalid: 0.000
Latency: 67.961 s
Output throughput: 3266.447 token/s

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-01-15T02:10:57Z

Summary of Changes

Hello @samuellees, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a critical optimization to the Gated Delta Rule (GDN) decoding kernel, specifically targeting the Qwen3 Next model. By adjusting a key parameter in the fused recurrent attention mechanism, the PR significantly improves overall inference performance, leading to higher output throughput and reduced latency, addressing a previously identified bottleneck.

Highlights

GDN Kernel Optimization: The BV block size in the fused_recurrent_gated_delta_rule_fwd function has been increased from 8 to 32. This change is aimed at improving the performance of the Gated Delta Rule (GDN) kernel, which was identified as a bottleneck for Qwen3 Next decoding.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request optimizes the Gated Delta Rule (GDR) kernel by increasing the block size BV from 8 to 32 in fused_recurrent_gated_delta_rule_fwd. This is a good performance tuning that, as the benchmarks show, significantly improves throughput.

However, this optimization seems to be applied inconsistently across the codebase. I've found a few other places where BV is still capped at 8. Given the PR's goal is to optimize decoding, these should probably be updated as well to ensure consistent performance improvements across all relevant kernels.

Specifically, please consider applying the same change to:

python/sglang/srt/layers/attention/fla/fused_recurrent.py at line 543 in fused_recurrent_gated_delta_rule_update_fwd.
python/sglang/srt/layers/attention/fla/fused_sigmoid_gating_recurrent.py at line 184 in fused_sigmoid_gating_delta_rule_update.
python/sglang/srt/layers/attention/fla/kda.py at line 56 in fused_recurrent_kda_fwd.

Applying this optimization consistently would likely yield further performance benefits, especially in decoding scenarios which seems to be the main goal of this PR.

Fridge003 · 2026-01-30T03:08:26Z

/rerun-stage unit-test-backend-4-gpu

github-actions · 2026-01-30T03:08:49Z

✅ Triggered unit-test-backend-4-gpu to run independently (skipping dependencies).

github-actions · 2026-01-30T03:08:55Z

🔗 View workflow run

Fridge003 · 2026-01-30T04:15:38Z

/rerun-stage unit-test-backend-4-gpu

github-actions · 2026-01-30T04:16:02Z

✅ Triggered unit-test-backend-4-gpu to run independently (skipping dependencies).

github-actions · 2026-01-30T04:16:08Z

🔗 View workflow run

Fridge003 · 2026-01-30T04:20:16Z

/tag-and-rerun-ci

Fridge003 · 2026-01-30T15:19:33Z

/rerun-stage unit-test-backend-4-gpu

github-actions · 2026-01-30T15:19:54Z

✅ Triggered unit-test-backend-4-gpu to run independently (skipping dependencies).

github-actions · 2026-01-30T15:20:01Z

🔗 View workflow run

Optimize GDN decoding

534440e

samuellees requested review from hebiao064 and yizhang2077 as code owners January 15, 2026 02:10

gemini-code-assist bot reviewed Jan 15, 2026

View reviewed changes

samuellees changed the title ~~Optimize GDN decoding for Qwen3 Next~~ [NV] Optimize GDN decode for Qwen3 Next Jan 15, 2026

samuellees changed the title ~~[NV] Optimize GDN decode for Qwen3 Next~~ Optimize GDN decode for Qwen3 Next Jan 15, 2026

yizhang2077 approved these changes Jan 26, 2026

View reviewed changes

Merge branch 'main' into gdn-decode-optimize

46e1272

github-actions bot added the run-ci label Jan 30, 2026

Fridge003 merged commit 81449b4 into sgl-project:main Jan 30, 2026
357 of 404 checks passed

yuki-brook pushed a commit to scitix/sglang that referenced this pull request Jan 31, 2026

Optimize GDN decode for Qwen3 Next (sgl-project#17094)

eca36e6

charlesHsuGG pushed a commit to charlesHsuGG/sglang that referenced this pull request Feb 2, 2026

Optimize GDN decode for Qwen3 Next (sgl-project#17094)

087a956

hlu1 mentioned this pull request Feb 4, 2026

[Qwen3Next] Optimize fused_sigmoid_gating_delta_rule_update_kernel #18271

Merged

5 tasks

sfiisf pushed a commit to sfiisf/sglang that referenced this pull request Feb 5, 2026

Optimize GDN decode for Qwen3 Next (sgl-project#17094)

45be07f

samuellees mentioned this pull request Feb 11, 2026

[Scratch] Qwen3 Next #18591

Closed

12 tasks

hlu1 mentioned this pull request Feb 11, 2026

[Tracking] Qwen3.5/Qwen3-Next Optimizations #18590

Open

38 tasks

Johnsonms pushed a commit to Johnsonms/sglang that referenced this pull request Feb 14, 2026

Optimize GDN decode for Qwen3 Next (sgl-project#17094)

a3fc8e7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize GDN decode for Qwen3 Next#17094

Optimize GDN decode for Qwen3 Next#17094
Fridge003 merged 2 commits intosgl-project:mainfrom
samuellees:gdn-decode-optimize

samuellees commented Jan 15, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Jan 15, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Fridge003 commented Jan 30, 2026

Uh oh!

github-actions bot commented Jan 30, 2026

Uh oh!

github-actions bot commented Jan 30, 2026

Uh oh!

Fridge003 commented Jan 30, 2026

Uh oh!

github-actions bot commented Jan 30, 2026

Uh oh!

github-actions bot commented Jan 30, 2026

Uh oh!

Fridge003 commented Jan 30, 2026

Uh oh!

Fridge003 commented Jan 30, 2026

Uh oh!

github-actions bot commented Jan 30, 2026

Uh oh!

github-actions bot commented Jan 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

samuellees commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests and Benchmarking

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist bot commented Jan 15, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Fridge003 commented Jan 30, 2026

Uh oh!

github-actions bot commented Jan 30, 2026

Uh oh!

github-actions bot commented Jan 30, 2026

Uh oh!

Fridge003 commented Jan 30, 2026

Uh oh!

github-actions bot commented Jan 30, 2026

Uh oh!

github-actions bot commented Jan 30, 2026

Uh oh!

Fridge003 commented Jan 30, 2026

Uh oh!

Fridge003 commented Jan 30, 2026

Uh oh!

github-actions bot commented Jan 30, 2026

Uh oh!

github-actions bot commented Jan 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

samuellees commented Jan 15, 2026 •

edited

Loading