Skip to content

Optimize GDN decode for Qwen3 Next#17094

Merged
Fridge003 merged 2 commits intosgl-project:mainfrom
samuellees:gdn-decode-optimize
Jan 30, 2026
Merged

Optimize GDN decode for Qwen3 Next#17094
Fridge003 merged 2 commits intosgl-project:mainfrom
samuellees:gdn-decode-optimize

Conversation

@samuellees
Copy link
Copy Markdown
Contributor

@samuellees samuellees commented Jan 15, 2026

Motivation

As MoE/FullAttn/GEMM has been optimized, the GDN kernel becomes a bottleneck for Q3N. For example, decode linear attention takes about 15~23% with cc=64/256 with ISL=1k in NVFP4 inference on B200.

This PR apply an optimize on GDN decode kernel, which brings significant benefits e2e.

Ref: vllm-project/vllm#31722

cc @yizhang2077

Modifications

Accuracy Tests and Benchmarking

TEP4, MTP OFF, B200
It gives about 10% e2e improvements on gsm8k dataset.

# Before
Accuracy: 0.945
Invalid: 0.000
Latency: 74.870 s
Output throughput: 2967.035 token/s
# After
Accuracy: 0.943
Invalid: 0.000
Latency: 67.961 s
Output throughput: 3266.447 token/s

Benchmarking and Profiling

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @samuellees, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a critical optimization to the Gated Delta Rule (GDN) decoding kernel, specifically targeting the Qwen3 Next model. By adjusting a key parameter in the fused recurrent attention mechanism, the PR significantly improves overall inference performance, leading to higher output throughput and reduced latency, addressing a previously identified bottleneck.

Highlights

  • GDN Kernel Optimization: The BV block size in the fused_recurrent_gated_delta_rule_fwd function has been increased from 8 to 32. This change is aimed at improving the performance of the Gated Delta Rule (GDN) kernel, which was identified as a bottleneck for Qwen3 Next decoding.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request optimizes the Gated Delta Rule (GDR) kernel by increasing the block size BV from 8 to 32 in fused_recurrent_gated_delta_rule_fwd. This is a good performance tuning that, as the benchmarks show, significantly improves throughput.

However, this optimization seems to be applied inconsistently across the codebase. I've found a few other places where BV is still capped at 8. Given the PR's goal is to optimize decoding, these should probably be updated as well to ensure consistent performance improvements across all relevant kernels.

Specifically, please consider applying the same change to:

  • python/sglang/srt/layers/attention/fla/fused_recurrent.py at line 543 in fused_recurrent_gated_delta_rule_update_fwd.
  • python/sglang/srt/layers/attention/fla/fused_sigmoid_gating_recurrent.py at line 184 in fused_sigmoid_gating_delta_rule_update.
  • python/sglang/srt/layers/attention/fla/kda.py at line 56 in fused_recurrent_kda_fwd.

Applying this optimization consistently would likely yield further performance benefits, especially in decoding scenarios which seems to be the main goal of this PR.

@samuellees samuellees changed the title Optimize GDN decoding for Qwen3 Next [NV] Optimize GDN decode for Qwen3 Next Jan 15, 2026
@samuellees samuellees changed the title [NV] Optimize GDN decode for Qwen3 Next Optimize GDN decode for Qwen3 Next Jan 15, 2026
@Fridge003
Copy link
Copy Markdown
Collaborator

/rerun-stage unit-test-backend-4-gpu

@github-actions
Copy link
Copy Markdown
Contributor

✅ Triggered unit-test-backend-4-gpu to run independently (skipping dependencies).

@github-actions
Copy link
Copy Markdown
Contributor

🔗 View workflow run

@Fridge003
Copy link
Copy Markdown
Collaborator

/rerun-stage unit-test-backend-4-gpu

@github-actions
Copy link
Copy Markdown
Contributor

✅ Triggered unit-test-backend-4-gpu to run independently (skipping dependencies).

@github-actions
Copy link
Copy Markdown
Contributor

🔗 View workflow run

@Fridge003
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@Fridge003
Copy link
Copy Markdown
Collaborator

/rerun-stage unit-test-backend-4-gpu

@github-actions
Copy link
Copy Markdown
Contributor

✅ Triggered unit-test-backend-4-gpu to run independently (skipping dependencies).

@github-actions
Copy link
Copy Markdown
Contributor

🔗 View workflow run

@Fridge003 Fridge003 merged commit 81449b4 into sgl-project:main Jan 30, 2026
357 of 404 checks passed
yuki-brook pushed a commit to scitix/sglang that referenced this pull request Jan 31, 2026
charlesHsuGG pushed a commit to charlesHsuGG/sglang that referenced this pull request Feb 2, 2026
sfiisf pushed a commit to sfiisf/sglang that referenced this pull request Feb 5, 2026
@samuellees samuellees mentioned this pull request Feb 11, 2026
12 tasks
Johnsonms pushed a commit to Johnsonms/sglang that referenced this pull request Feb 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants