[Qwen3-Next] Optimize Prefill Kernel, add GDN Gluon kernel and optimize cumsum kernel by slowlyC · Pull Request #17983 · sgl-project/sglang

slowlyC · 2026-01-30T08:33:18Z

Motivation

This PR optimizes the prefill kernel for Qwen3-Next (Gated Delta Rule) models, focusing on key improvements:

Blackwell GPU Performance: Add Gluon kernels that leverage the la NVIDIA GPU features (compute capability >= 10.0) for significantly improved memory bandwidth and throughput.
Transposed Initial State Support: Enable transposed initial state layout [N, H, V, K] in addition to the default [N, H, K, V], which improves memory access patterns and reduces transpose overhead during prefill-decode transitions.
Cumsum Kernel Optimization: Add a vectorized cumsum kernel that processes multiple heads in a single kernel launch, reducing kernel launch overhead and improving GPU utilization.

For decode, we use cutedsl to write the fuse_recurrent kernel for both decode/mtp. [Qwen3-Next] Add cutedsl decode/mtp kernel with transposed ssm_state and prefill gluon kernel for blackwell. ##17981

Modifications

sglang/python/sglang/srt/layers/attention/fla:

utils.py: Added IS_GLUON_SUPPORTED and FLA_CUMSUM_SCALAR_VECTORIZATION feature flags
cumsum.py: Added chunk_local_cumsum_scalar_vectorization_kernel that processes BH heads simultaneously
wy_fast.py/chunk_delta_h.py/chunk_o.py: Integrated Gluon kernel
gluon/wy_fast_gluon.py: WY factorization kernel for computing w and u matrices
gluon/chunk_delta_h_gluon.py: delta rule hidden state update kernel with efficient async memory operations
gluon/chunk_o_gluon.py: output with chunk-wise

Accuracy Tests

# python sglang/benchmark/gsm8k/bench_sglang.py                                         
Accuracy: 0.953
Invalid: 0.000
Latency: 6.152 s

Benchmarking and Profiling

on blackwell input:output=32K:1

cumsum kernel: 7us --> 3us(+100%)

chunk_fwd_o gluon kernel: 133us --> 69us(+100%)

wy_fast gluon kernel: 69us --> 50us(+40%)

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

…t transpose initial_state 3.Optimize cumsum kernel Co-authored-by: Jon-WZQ <wuziqiang.wzq@alibaba-inc.com>

gemini-code-assist · 2026-01-30T08:33:42Z

Summary of Changes

Hello @slowlyC, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the performance of the prefill kernel for Qwen3-Next models by integrating advanced GPU optimizations. It introduces specialized Gluon kernels for NVIDIA Blackwell/Hopper architectures, enabling more efficient memory operations. Additionally, it improves flexibility by supporting transposed initial state layouts and boosts throughput with a vectorized cumsum kernel, all aimed at reducing latency and improving overall GPU utilization.

Highlights

Gluon Kernel Integration: Introduced new Gluon kernels for chunk_delta_h, chunk_o, and wy_fast operations, leveraging NVIDIA Blackwell/Hopper GPU features (compute capability >= 10.0) for enhanced memory bandwidth and throughput.
Transposed Initial State Support: Added support for a transposed initial state layout [N, H, V, K] in the chunk_delta_h kernel, which can improve memory access patterns and reduce transpose overhead during prefill-decode transitions.
Vectorized Cumsum Kernel: Implemented a new vectorized cumsum kernel (chunk_local_cumsum_scalar_vectorization_kernel) that processes multiple heads simultaneously, reducing kernel launch overhead and improving GPU utilization.
Feature Flags for Optimization Control: Introduced IS_GLUON_SUPPORTED and FLA_CUMSUM_SCALAR_VECTORIZATION flags to conditionally enable these new optimizations based on detected hardware capabilities and environment variables.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces significant optimizations for Qwen3-Next models by integrating Triton Gluon kernels for newer NVIDIA GPUs, adding support for transposed initial state layouts, and providing a vectorized cumsum kernel. The changes are well-structured and use feature flags to control the new functionalities. My review focuses on improving code maintainability by reducing duplication, fixing a potential bug in the kernel dispatch logic, and enhancing robustness. Overall, this is a valuable contribution that should improve performance.

python/sglang/srt/layers/attention/fla/chunk_delta_h.py

gemini-code-assist · 2026-01-30T08:36:02Z

python/sglang/srt/layers/attention/fla/chunk_delta_h.py

+if IS_GLUON_SUPPORTED:
+    try:
+        from triton.experimental.gluon import language as gl
+        from triton.experimental.gluon.nvidia.hopper import TensorDescriptor
+        from sglang.srt.layers.attention.fla.gluon.chunk_delta_h_gluon import chunk_gated_delta_rule_fwd_kernel_h_blockdim64_gluon
+
+    except ImportError as e:
+        raise ImportError(
+            f">>> Failed to import Gluon in current triton version {triton.__version__} and "
+            f">>> Platform {torch.cuda.get_device_capability()}.\n"
+            f">>> Gluon/Blackwell features require: \n"
+            f">>> 1. Triton >= 3.6.0\n"
+            f">>> 2. NVIDIA GPU (compute capability >= 10.0)\n"
+            f">>> Error: {e}\n"
+            f">>> Set FLA_USE_GLUON=0 to disable and continue."
+        ) from e


The try...except block for importing Gluon and handling ImportError is duplicated across multiple files (e.g., chunk_o.py, wy_fast.py, and the new gluon/*.py files). To improve maintainability and reduce code repetition, consider centralizing this import and error-handling logic. A shared utility function in fla/utils.py could perform the import and return the necessary modules or raise a single, consistent error. This would make it easier to update the requirements or error messages in one place.

python/sglang/srt/layers/attention/fla/chunk_o.py

python/sglang/srt/layers/attention/fla/gluon/chunk_o_gluon.py

python/sglang/srt/layers/attention/fla/gluon/chunk_delta_h_gluon.py

python/sglang/srt/layers/attention/fla/cumsum.py

python/sglang/srt/layers/attention/fla/wy_fast.py

BBuf · 2026-02-04T08:12:46Z

python/sglang/srt/layers/attention/fla/wy_fast.py

+    try:
+        from triton.experimental.gluon import language as gl
+        from triton.experimental.gluon.nvidia.hopper import TensorDescriptor
+        from sglang.srt.layers.attention.fla.gluon.wy_fast_gluon import recompute_w_u_fwd_kernel_gluon


nits: Can wy_fast_gluon has a better name?

This is for aligning wy_fast.py

BBuf · 2026-02-04T08:14:32Z

/tag-and-rerun-ci

…nel, refine code

BBuf · 2026-02-06T09:14:18Z

/rerun-failed-ci

…spose, avoid convert_layout. 2.prefetch w before main loop and next iter

slowlyC · 2026-02-10T04:47:14Z

/rerun-failed-ci

python/sglang/srt/layers/attention/fla/gluon/chunk_o_gluon.py

[Qwen3-Next] Optimize Prefill Kernel: 1.Add GDN Gluon kernel 2.Suppor…

e326391

…t transpose initial_state 3.Optimize cumsum kernel Co-authored-by: Jon-WZQ <wuziqiang.wzq@alibaba-inc.com>

slowlyC requested review from hebiao064 and yizhang2077 as code owners January 30, 2026 08:33

gemini-code-assist bot reviewed Jan 30, 2026

View reviewed changes

Jon-WZQ mentioned this pull request Jan 30, 2026

[Qwen3-Next] Add cutedsl decode/mtp kernel with transposed ssm_state and prefill gluon kernel for blackwell. #17981

Open

5 tasks

BBuf reviewed Feb 4, 2026

View reviewed changes

python/sglang/srt/layers/attention/fla/gluon/chunk_delta_h_gluon.py Outdated Show resolved Hide resolved

BBuf reviewed Feb 4, 2026

View reviewed changes

python/sglang/srt/layers/attention/fla/cumsum.py Outdated Show resolved Hide resolved

BBuf reviewed Feb 4, 2026

View reviewed changes

python/sglang/srt/layers/attention/fla/wy_fast.py Show resolved Hide resolved

BBuf reviewed Feb 4, 2026

View reviewed changes

BBuf approved these changes Feb 4, 2026

View reviewed changes

github-actions bot added the run-ci label Feb 4, 2026

slowlyC added 2 commits February 4, 2026 18:03

[Qwen3-Next]add pytorch and device_capability limit for fla gluon ker…

974e2a1

…nel, refine code

Apply suggestion from @gemini-code-assist[bot]

a1f60da

slowlyC force-pushed the gdn-optimize-prefill branch from 227a2b4 to a1f60da Compare February 5, 2026 01:20

slowlyC added 2 commits February 9, 2026 14:28

[Qwen3-Next] Optimize chunk_fwd_o_gluon kernel, combine mma commit

445add4

[Qwen3-Next] Optimize chunk_fwd_h_gluon kernel: 1.use h0_smem to tran…

30cf029

…spose, avoid convert_layout. 2.prefetch w before main loop and next iter

hlu1 mentioned this pull request Feb 11, 2026

[Tracking] Qwen3.5/Qwen3-Next Optimizations #18590

Open

38 tasks

hebiao064 reviewed Feb 11, 2026

View reviewed changes

python/sglang/srt/layers/attention/fla/gluon/chunk_o_gluon.py Outdated Show resolved Hide resolved

slowlyC force-pushed the gdn-optimize-prefill branch from af5cdd8 to f23b095 Compare February 11, 2026 09:59

[Qwen3-Next] refine import path for GDN gluon prefill kernel

a88a2ad

slowlyC force-pushed the gdn-optimize-prefill branch from 2ee3982 to a88a2ad Compare February 12, 2026 02:41

slowlyC added 4 commits February 12, 2026 13:50

Merge branch 'main' into gdn-optimize-prefill

3155c57

Merge branch 'main' into gdn-optimize-prefill

6c97979

Merge branch 'main' into gdn-optimize-prefill

460be3c

Merge branch 'main' into gdn-optimize-prefill

2b32f08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Qwen3-Next] Optimize Prefill Kernel, add GDN Gluon kernel and optimize cumsum kernel#17983

[Qwen3-Next] Optimize Prefill Kernel, add GDN Gluon kernel and optimize cumsum kernel#17983
slowlyC wants to merge 10 commits intosgl-project:mainfrom
Jon-WZQ:gdn-optimize-prefill

slowlyC commented Jan 30, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Jan 30, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

gemini-code-assist bot Jan 30, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

BBuf Feb 4, 2026

Uh oh!

slowlyC Feb 4, 2026

Uh oh!

BBuf commented Feb 4, 2026

Uh oh!

BBuf commented Feb 6, 2026

Uh oh!

slowlyC commented Feb 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

slowlyC commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist bot commented Jan 30, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

BBuf Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

slowlyC Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

BBuf commented Feb 4, 2026

Uh oh!

BBuf commented Feb 6, 2026

Uh oh!

slowlyC commented Feb 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

slowlyC commented Jan 30, 2026 •

edited

Loading