Skip to content

[Qwen3-Next] Optimize Prefill Kernel, add GDN Gluon kernel and optimize cumsum kernel#17983

Open
slowlyC wants to merge 10 commits intosgl-project:mainfrom
Jon-WZQ:gdn-optimize-prefill
Open

[Qwen3-Next] Optimize Prefill Kernel, add GDN Gluon kernel and optimize cumsum kernel#17983
slowlyC wants to merge 10 commits intosgl-project:mainfrom
Jon-WZQ:gdn-optimize-prefill

Conversation

@slowlyC
Copy link
Copy Markdown

@slowlyC slowlyC commented Jan 30, 2026

Motivation

This PR optimizes the prefill kernel for Qwen3-Next (Gated Delta Rule) models, focusing on key improvements:

  • Blackwell GPU Performance: Add Gluon kernels that leverage the la NVIDIA GPU features (compute capability >= 10.0) for significantly improved memory bandwidth and throughput.
  • Transposed Initial State Support: Enable transposed initial state layout [N, H, V, K] in addition to the default [N, H, K, V], which improves memory access patterns and reduces transpose overhead during prefill-decode transitions.
  • Cumsum Kernel Optimization: Add a vectorized cumsum kernel that processes multiple heads in a single kernel launch, reducing kernel launch overhead and improving GPU utilization.

For decode, we use cutedsl to write the fuse_recurrent kernel for both decode/mtp. [Qwen3-Next] Add cutedsl decode/mtp kernel with transposed ssm_state and prefill gluon kernel for blackwell. ##17981

Modifications

sglang/python/sglang/srt/layers/attention/fla:

  1. utils.py: Added IS_GLUON_SUPPORTED and FLA_CUMSUM_SCALAR_VECTORIZATION feature flags
  2. cumsum.py: Added chunk_local_cumsum_scalar_vectorization_kernel that processes BH heads simultaneously
  3. wy_fast.py/chunk_delta_h.py/chunk_o.py: Integrated Gluon kernel
  4. gluon/wy_fast_gluon.py: WY factorization kernel for computing w and u matrices
  5. gluon/chunk_delta_h_gluon.py: delta rule hidden state update kernel with efficient async memory operations
  6. gluon/chunk_o_gluon.py: output with chunk-wise

Accuracy Tests

# python sglang/benchmark/gsm8k/bench_sglang.py                                         
Accuracy: 0.953
Invalid: 0.000
Latency: 6.152 s

Benchmarking and Profiling

on blackwell input:output=32K:1

image image
  • cumsum kernel: 7us --> 3us(+100%)
image image
  • chunk_fwd_o gluon kernel: 133us --> 69us(+100%)
image image
  • wy_fast gluon kernel: 69us --> 50us(+40%)
image image

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

…t transpose initial_state 3.Optimize cumsum kernel

Co-authored-by: Jon-WZQ <wuziqiang.wzq@alibaba-inc.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @slowlyC, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the performance of the prefill kernel for Qwen3-Next models by integrating advanced GPU optimizations. It introduces specialized Gluon kernels for NVIDIA Blackwell/Hopper architectures, enabling more efficient memory operations. Additionally, it improves flexibility by supporting transposed initial state layouts and boosts throughput with a vectorized cumsum kernel, all aimed at reducing latency and improving overall GPU utilization.

Highlights

  • Gluon Kernel Integration: Introduced new Gluon kernels for chunk_delta_h, chunk_o, and wy_fast operations, leveraging NVIDIA Blackwell/Hopper GPU features (compute capability >= 10.0) for enhanced memory bandwidth and throughput.
  • Transposed Initial State Support: Added support for a transposed initial state layout [N, H, V, K] in the chunk_delta_h kernel, which can improve memory access patterns and reduce transpose overhead during prefill-decode transitions.
  • Vectorized Cumsum Kernel: Implemented a new vectorized cumsum kernel (chunk_local_cumsum_scalar_vectorization_kernel) that processes multiple heads simultaneously, reducing kernel launch overhead and improving GPU utilization.
  • Feature Flags for Optimization Control: Introduced IS_GLUON_SUPPORTED and FLA_CUMSUM_SCALAR_VECTORIZATION flags to conditionally enable these new optimizations based on detected hardware capabilities and environment variables.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant optimizations for Qwen3-Next models by integrating Triton Gluon kernels for newer NVIDIA GPUs, adding support for transposed initial state layouts, and providing a vectorized cumsum kernel. The changes are well-structured and use feature flags to control the new functionalities. My review focuses on improving code maintainability by reducing duplication, fixing a potential bug in the kernel dispatch logic, and enhancing robustness. Overall, this is a valuable contribution that should improve performance.

Comment on lines +18 to +33
if IS_GLUON_SUPPORTED:
try:
from triton.experimental.gluon import language as gl
from triton.experimental.gluon.nvidia.hopper import TensorDescriptor
from sglang.srt.layers.attention.fla.gluon.chunk_delta_h_gluon import chunk_gated_delta_rule_fwd_kernel_h_blockdim64_gluon

except ImportError as e:
raise ImportError(
f">>> Failed to import Gluon in current triton version {triton.__version__} and "
f">>> Platform {torch.cuda.get_device_capability()}.\n"
f">>> Gluon/Blackwell features require: \n"
f">>> 1. Triton >= 3.6.0\n"
f">>> 2. NVIDIA GPU (compute capability >= 10.0)\n"
f">>> Error: {e}\n"
f">>> Set FLA_USE_GLUON=0 to disable and continue."
) from e
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The try...except block for importing Gluon and handling ImportError is duplicated across multiple files (e.g., chunk_o.py, wy_fast.py, and the new gluon/*.py files). To improve maintainability and reduce code repetition, consider centralizing this import and error-handling logic. A shared utility function in fla/utils.py could perform the import and return the necessary modules or raise a single, consistent error. This would make it easier to update the requirements or error messages in one place.

try:
from triton.experimental.gluon import language as gl
from triton.experimental.gluon.nvidia.hopper import TensorDescriptor
from sglang.srt.layers.attention.fla.gluon.wy_fast_gluon import recompute_w_u_fwd_kernel_gluon
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits: Can wy_fast_gluon has a better name?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for aligning wy_fast.py

@BBuf
Copy link
Copy Markdown
Collaborator

BBuf commented Feb 4, 2026

/tag-and-rerun-ci

@github-actions github-actions bot added the run-ci label Feb 4, 2026
@slowlyC slowlyC force-pushed the gdn-optimize-prefill branch from 227a2b4 to a1f60da Compare February 5, 2026 01:20
@BBuf
Copy link
Copy Markdown
Collaborator

BBuf commented Feb 6, 2026

/rerun-failed-ci

@slowlyC
Copy link
Copy Markdown
Author

slowlyC commented Feb 10, 2026

/rerun-failed-ci

@slowlyC slowlyC force-pushed the gdn-optimize-prefill branch from af5cdd8 to f23b095 Compare February 11, 2026 09:59
@slowlyC slowlyC force-pushed the gdn-optimize-prefill branch from 2ee3982 to a88a2ad Compare February 12, 2026 02:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants