[Qwen3-Next] Optimize Prefill Kernel, add GDN Gluon kernel and optimize cumsum kernel#17983
[Qwen3-Next] Optimize Prefill Kernel, add GDN Gluon kernel and optimize cumsum kernel#17983slowlyC wants to merge 10 commits intosgl-project:mainfrom
Conversation
…t transpose initial_state 3.Optimize cumsum kernel Co-authored-by: Jon-WZQ <wuziqiang.wzq@alibaba-inc.com>
Summary of ChangesHello @slowlyC, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the performance of the prefill kernel for Qwen3-Next models by integrating advanced GPU optimizations. It introduces specialized Gluon kernels for NVIDIA Blackwell/Hopper architectures, enabling more efficient memory operations. Additionally, it improves flexibility by supporting transposed initial state layouts and boosts throughput with a vectorized cumsum kernel, all aimed at reducing latency and improving overall GPU utilization. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces significant optimizations for Qwen3-Next models by integrating Triton Gluon kernels for newer NVIDIA GPUs, adding support for transposed initial state layouts, and providing a vectorized cumsum kernel. The changes are well-structured and use feature flags to control the new functionalities. My review focuses on improving code maintainability by reducing duplication, fixing a potential bug in the kernel dispatch logic, and enhancing robustness. Overall, this is a valuable contribution that should improve performance.
| if IS_GLUON_SUPPORTED: | ||
| try: | ||
| from triton.experimental.gluon import language as gl | ||
| from triton.experimental.gluon.nvidia.hopper import TensorDescriptor | ||
| from sglang.srt.layers.attention.fla.gluon.chunk_delta_h_gluon import chunk_gated_delta_rule_fwd_kernel_h_blockdim64_gluon | ||
|
|
||
| except ImportError as e: | ||
| raise ImportError( | ||
| f">>> Failed to import Gluon in current triton version {triton.__version__} and " | ||
| f">>> Platform {torch.cuda.get_device_capability()}.\n" | ||
| f">>> Gluon/Blackwell features require: \n" | ||
| f">>> 1. Triton >= 3.6.0\n" | ||
| f">>> 2. NVIDIA GPU (compute capability >= 10.0)\n" | ||
| f">>> Error: {e}\n" | ||
| f">>> Set FLA_USE_GLUON=0 to disable and continue." | ||
| ) from e |
There was a problem hiding this comment.
The try...except block for importing Gluon and handling ImportError is duplicated across multiple files (e.g., chunk_o.py, wy_fast.py, and the new gluon/*.py files). To improve maintainability and reduce code repetition, consider centralizing this import and error-handling logic. A shared utility function in fla/utils.py could perform the import and return the necessary modules or raise a single, consistent error. This would make it easier to update the requirements or error messages in one place.
python/sglang/srt/layers/attention/fla/gluon/chunk_delta_h_gluon.py
Outdated
Show resolved
Hide resolved
| try: | ||
| from triton.experimental.gluon import language as gl | ||
| from triton.experimental.gluon.nvidia.hopper import TensorDescriptor | ||
| from sglang.srt.layers.attention.fla.gluon.wy_fast_gluon import recompute_w_u_fwd_kernel_gluon |
There was a problem hiding this comment.
nits: Can wy_fast_gluon has a better name?
|
/tag-and-rerun-ci |
227a2b4 to
a1f60da
Compare
|
/rerun-failed-ci |
…spose, avoid convert_layout. 2.prefetch w before main loop and next iter
|
/rerun-failed-ci |
af5cdd8 to
f23b095
Compare
2ee3982 to
a88a2ad
Compare
Motivation
This PR optimizes the prefill kernel for Qwen3-Next (Gated Delta Rule) models, focusing on key improvements:
For decode, we use cutedsl to write the fuse_recurrent kernel for both decode/mtp. [Qwen3-Next] Add cutedsl decode/mtp kernel with transposed ssm_state and prefill gluon kernel for blackwell. ##17981
Modifications
sglang/python/sglang/srt/layers/attention/fla:
Accuracy Tests
Benchmarking and Profiling
on blackwell input:output=32K:1
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci