[Feature][JIT Kernel] Fused TP QK norm For Minimax by DarkSharpness · Pull Request #20673 · sgl-project/sglang

DarkSharpness · 2026-03-16T08:18:31Z

Motivation

Adapted from trt-llm kernels. Special thanks to @jmydurant. We mainly optimize the memory access and reuse the custom all reduce v2 in SGLang.

Should be merged after #19880

Modifications

Accuracy Tests

Benchmarking and Profiling

python -m sglang.launch_server \                                                                          
    --model-path MiniMaxAI/MiniMax-M2.5 \
    --tp-size 4 \
    --tool-call-parser minimax-m2 \
    --reasoning-parser minimax-append-think \
    --host 0.0.0.0 \
    --trust-remote-code

python -m sglang.test.send_one

Decode performance:

Before: 150 tps; After: 157 tps (both already using JIT custom all reduce)

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-03-16T08:19:03Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant performance optimization for tensor-parallel operations by implementing a JIT-compiled fused QK norm kernel. This new kernel, adapted from TensorRT-LLM, aims to enhance memory access patterns and leverage the custom all-reduce v2 for efficient distributed computation. The changes include the addition of new CUDA kernels, Python bindings, and dedicated benchmarks and tests, culminating in its integration into the MiniMaxM2 model architecture.

Highlights

Fused QK Norm Kernel: Introduced a new JIT-compiled kernel for fused tensor-parallel Query-Key (QK) normalization, adapted from TensorRT-LLM, to optimize memory access and computation.
JIT Custom All-Reduce Integration: Implemented a new Python module (all_reduce.py) to expose the JIT custom all-reduce and fused QK norm functionalities, along with their C++ CUDA kernel definitions.
Benchmarking and Testing: Added comprehensive benchmarking and correctness test suites for both the JIT custom all-reduce and the new fused QK norm to validate performance and accuracy.
Refactored All-Reduce Logic: Refactored the existing custom all-reduce (v1) to utilize a new utility function for NVLink and P2P capability checks, and introduced an opt-in mechanism for the JIT-compiled v2 implementation.
Model Integration: Integrated the fused QK norm into the MiniMaxM2 model, replacing the naive implementation for improved performance in tensor-parallel setups.
Infrastructure Updates: Updated .clang-format rules and introduced new C++ CUDA headers for distributed communication primitives and FFI tensor utilities to support the new kernels.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

python/sglang/jit_kernel/.clang-format
- Updated include category regex to be more general.
python/sglang/jit_kernel/all_reduce.py
- Added Python module for JIT custom all-reduce and fused QK norm functionalities.
python/sglang/jit_kernel/benchmark/bench_custom_all_reduce.py
- Added benchmark script for JIT custom all-reduce (v2) against NCCL and AOT custom all-reduce (v1).
python/sglang/jit_kernel/benchmark/bench_tp_qknorm.py
- Added benchmark script for the new fused parallel QK norm.
python/sglang/jit_kernel/csrc/distributed/custom_all_reduce_base.cuh
- Added C++ header for the custom all-reduce base class and FFI registration.
python/sglang/jit_kernel/csrc/distributed/custom_all_reduce_pull.cuh
- Added C++ CUDA kernel for pull-based custom all-reduce.
python/sglang/jit_kernel/csrc/distributed/custom_all_reduce_push.cuh
- Added C++ CUDA kernel for push-based custom all-reduce.
python/sglang/jit_kernel/csrc/distributed/tp_qknorm.cuh
- Added C++ CUDA kernel for the fused parallel QK norm across heads.
python/sglang/jit_kernel/csrc/gemm/marlin/marlin.cuh
- Removed an unused host::div_ceil import.
python/sglang/jit_kernel/include/sgl_kernel/distributed/common.cuh
- Added C++ CUDA header defining distributed communication primitives like Semaphores and controllers.
python/sglang/jit_kernel/include/sgl_kernel/distributed/custom_all_reduce.cuh
- Added C++ CUDA header for the custom all-reduce base class, IPC handle utilities, and reduction implementation.
python/sglang/jit_kernel/include/sgl_kernel/ffi.h
- Added C++ header providing FFI tensor creation and manipulation utilities.
python/sglang/jit_kernel/include/sgl_kernel/utils.cuh
- Added a div_ceil utility function for integer division.
python/sglang/jit_kernel/include/sgl_kernel/vec.cuh
- Updated AlignedVector load and store methods to accept void* pointers for greater flexibility.
python/sglang/jit_kernel/include/sgl_kernel/warp.cuh
- Updated the reduce_sum template to allow specifying the number of threads for warp-level reduction.
python/sglang/jit_kernel/tests/test_custom_all_reduce.py
- Added correctness tests for the JIT custom all-reduce (v2) kernel.
python/sglang/jit_kernel/tests/test_tp_qknorm.py
- Added correctness tests for the fused parallel QK norm.
python/sglang/srt/distributed/device_communicators/custom_all_reduce.py
- Refactored custom all-reduce (v1) initialization and added an opt-in flag for the JIT-compiled v2.
python/sglang/srt/distributed/device_communicators/custom_all_reduce_utils.py
- Moved and consolidated NVLink and P2P capability check logic into a new utility function.
python/sglang/srt/distributed/device_communicators/custom_all_reduce_v2.py
- Added new Python module implementing the JIT-compiled custom all-reduce v2.
python/sglang/srt/models/minimax_m2.py
- Integrated the new fused QK norm into the MiniMaxM2 model's attention mechanism, replacing the previous naive implementation.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

DarkSharpness · 2026-03-16T08:24:36Z

Performance result (q_dim = 6144, k_dim = 1024, TP=4):

H200

q_dim	k_dim	batch	fused_us	baseline_us
6144	1024	1	2.6	5.1
6144	1024	2	2.7	5.1
6144	1024	4	2.7	5.1
6144	1024	8	2.8	5.2
6144	1024	16	2.9	5.2
6144	1024	32	2.9	5.3
6144	1024	64	2.9	5.4
6144	1024	128	2.9	5.6
6144	1024	256	3.1	5.9
6144	1024	512	3.5	6.6
6144	1024	1024	4.3	7.6
6144	1024	2048	8.0	11.2
6144	1024	4096	15.0	16.0
6144	1024	8192	28.4	28.5
6144	1024	16384	52.0	52.7

B200

q_dim	k_dim	batch	fused_us	baseline_us
6144	1024	1	4.1	6.2
6144	1024	2	4.3	6.4
6144	1024	4	4.4	6.2
6144	1024	8	4.4	6.5
6144	1024	16	4.4	6.5
6144	1024	32	4.5	6.6
6144	1024	64	4.5	6.9
6144	1024	128	4.6	6.8
6144	1024	256	4.7	6.8
6144	1024	512	5.0	7.5
6144	1024	1024	6.4	8.5
6144	1024	2048	8.5	10.1
6144	1024	4096	14.5	14.6
6144	1024	8192	23.8	20.6
6144	1024	16384	42.0	35.1

gemini-code-assist

Code Review

This pull request introduces a significant performance optimization by adding a JIT-compiled fused kernel for tensor-parallel QK normalization, adapted from TensorRT-LLM. It also brings in a new, more flexible JIT-based custom all-reduce framework (v2) that the fused kernel leverages. The changes are extensive, including new C++ CUDA kernels, Python wrappers, comprehensive benchmarks, and correctness tests. The refactoring of the existing custom all-reduce infrastructure to support this new implementation is also well-done. I've found one critical issue regarding the mathematical correctness of the RMSNorm calculation within the new fused kernel, which I've detailed in a specific comment. Once that is addressed, this will be an excellent contribution.

DarkSharpness · 2026-03-20T10:39:23Z

/tag-and-rerun-ci

BBuf · 2026-03-26T08:24:48Z

+        self._world_size = get_tensor_model_parallel_world_size()
+        self._eps = q_norm.variance_epsilon
+        self._cpu_group = get_tp_group().cpu_group
+        use_fused_norm = get_bool_env_var("SGLANG_USE_FUSED_PARALLEL_QKNORM")


Should we add this environment variable to doc?

Can this be a server arg instead of an env var?

DarkSharpness · 2026-04-03T15:34:29Z

/rerun-failed-ci try again try again

Co-authored-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>

BBuf · 2026-04-13T12:29:37Z

Merged with all ci green. https://github.com/sgl-project/sglang/actions/runs/24117981354/job/70978742116?pr=20673

Co-authored-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>

nvpohanh · 2026-04-14T09:22:44Z

cc @trevor-m

Co-authored-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>

github-actions Bot added the jit-kernel label Mar 16, 2026

gemini-code-assist Bot reviewed Mar 16, 2026

View reviewed changes

Comment thread python/sglang/jit_kernel/csrc/distributed/tp_qknorm.cuh

Comment thread python/sglang/jit_kernel/csrc/distributed/tp_qknorm.cuh

DarkSharpness changed the title ~~[Feature][JIT Kernel] Fused TP QK norm~~ [Feature][JIT Kernel] Fused TP QK norm For Minimax Mar 18, 2026

DarkSharpness marked this pull request as ready for review March 20, 2026 10:27

DarkSharpness requested review from BBuf, HydraQYH, celve, ch-wan, merrymercy, yizhang2077 and yuan-luo as code owners March 20, 2026 10:27

DarkSharpness closed this Mar 20, 2026

DarkSharpness force-pushed the misc_qknorm_ar branch from 5b1da52 to f418327 Compare March 20, 2026 10:32

DarkSharpness reopened this Mar 20, 2026

github-actions Bot added the run-ci label Mar 20, 2026

DarkSharpness force-pushed the misc_qknorm_ar branch 2 times, most recently from 7262bd2 to 7aca6f0 Compare March 26, 2026 04:59

BBuf reviewed Mar 26, 2026

View reviewed changes

Comment thread python/sglang/jit_kernel/benchmark/bench_tp_qknorm.py Outdated

BBuf reviewed Mar 26, 2026

View reviewed changes

Comment thread python/sglang/jit_kernel/csrc/distributed/tp_qknorm.cuh

BBuf reviewed Mar 26, 2026

View reviewed changes

Comment thread python/sglang/srt/models/minimax_m2.py Outdated

BBuf approved these changes Mar 27, 2026

View reviewed changes

DarkSharpness force-pushed the misc_qknorm_ar branch from 74ca882 to 7a21611 Compare April 2, 2026 00:48

DarkSharpness added the high priority label Apr 2, 2026

DarkSharpness force-pushed the misc_qknorm_ar branch 2 times, most recently from f1baa10 to a6d0557 Compare April 7, 2026 04:48

DarkSharpness and others added 15 commits April 8, 2026 12:37

feat: support all reduce + qk norm

4b4e6eb

feat: improve performance and add benchmark

897126f

feat: optimize occupancy and reduce waves

c4fd303

minor: add credit; fix benchmark

7b0ceeb

Co-authored-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>

feat: prefetch to improve performance

594c45b

minor: register ci

b8f0415

minor: minor add benchmark ci

d26c393

minor: fix test

889b04f

fix: fix wrong size calculation

3fd6366

misc: fxxk ci

6de363b

minor: fix lint

93b3fb6

misc: reduce push signal size

861081b

fix: fix after rebase

d1efab1

feat: integrate into m2 model

b961696

feat: update benchmark

6355c17

DarkSharpness force-pushed the misc_qknorm_ar branch from 3c08b19 to d1efab1 Compare April 8, 2026 04:37

BBuf merged commit 314d6ec into sgl-project:main Apr 13, 2026
284 of 318 checks passed

michaelzhang-ai mentioned this pull request Apr 13, 2026

[AMD] Add MiniMax-M2.7 accuracy and performance nightly tests #22722

Merged

pyc96 pushed a commit to pyc96/sglang that referenced this pull request Apr 14, 2026

[Feature][JIT Kernel] Fused TP QK norm For Minimax (sgl-project#20673)

97026a3

Co-authored-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>

DarkSharpness deleted the misc_qknorm_ar branch April 14, 2026 11:52

QingGeng36 mentioned this pull request Apr 20, 2026

[Bug] Minimax 2.5 does not support TP16 in PD-disaggregated scenarios #23260

Open

5 tasks

yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026

[Feature][JIT Kernel] Fused TP QK norm For Minimax (sgl-project#20673)

71faaf3

Co-authored-by: Mingyang Jiang <13463932+jmydurant@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature][JIT Kernel] Fused TP QK norm For Minimax#20673

[Feature][JIT Kernel] Fused TP QK norm For Minimax#20673
BBuf merged 15 commits into
sgl-project:mainfrom
DarkSharpness:misc_qknorm_ar

DarkSharpness commented Mar 16, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Mar 16, 2026

Uh oh!

DarkSharpness commented Mar 16, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

DarkSharpness commented Mar 20, 2026

Uh oh!

Uh oh!

BBuf Mar 26, 2026

Uh oh!

trevor-m Apr 10, 2026

Uh oh!

Uh oh!

Uh oh!

DarkSharpness commented Apr 3, 2026 •

edited

Loading

Uh oh!

BBuf commented Apr 13, 2026

Uh oh!

Uh oh!

nvpohanh commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

DarkSharpness commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist Bot commented Mar 16, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

DarkSharpness commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

DarkSharpness commented Mar 20, 2026

Uh oh!

Uh oh!

BBuf Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

trevor-m Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

DarkSharpness commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BBuf commented Apr 13, 2026

Uh oh!

Uh oh!

nvpohanh commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

DarkSharpness commented Mar 16, 2026 •

edited

Loading

DarkSharpness commented Mar 16, 2026 •

edited

Loading

DarkSharpness commented Apr 3, 2026 •

edited

Loading