[Feature] Support JIT set kv cache by DarkSharpness · Pull Request #16273 · sgl-project/sglang

DarkSharpness · 2026-01-01T17:53:38Z

Motivation

Currently, SGLang uses torch native API to store key/value to cache, which is highly inefficient. Even overlapped in 2 streams, the performance is still poor.

Modifications

This PR is a superset of #9775, the AOT kernel in SGLang. We introduce many aggressive optimizations to minimize the latency, especially for cases where num_kv_head * head_dim is large (e.g. 1024 for Llama 3.1 8B on 1 GPU)

This PR also fixes some minor errors in qknorm, and move norm.cuh to elementwise/qknorm.cuh.

Accuracy Tests

Benchmarking and Profiling

Latency (μs) on B200. PyTorch 2 Stream is current SGLang implementation.

item_size	batch_size	SGL AOT Kernel	SGL JIT Kernel	PyTorch Compile	PyTorch 2 Stream
64	1	1.475760	1.009964	2.879933	2.137862
64	2	1.479312	1.008584	1.332617	2.165983
64	4	1.495821	1.015186	1.366358	2.651818
64	8	1.526468	1.024717	1.368124	3.605899
64	16	1.546549	1.030255	1.373649	3.568185
64	32	1.551000	1.029608	1.369916	3.574067
64	64	1.550317	1.033704	1.381571	3.578550
64	128	1.549651	1.039844	1.400229	3.606233
64	256	1.556941	1.057436	1.428082	3.641356
64	512	1.574894	1.083511	1.458229	3.716218
64	1024	1.603472	1.166626	1.525649	3.891339
64	2048	1.654045	1.438690	1.632083	3.951207
64	4096	1.911969	2.006328	2.048600	4.124578
64	8192	2.149733	3.094516	3.121697	6.165115
64	16384	3.372198	5.360094	5.390531	10.150578
128	1	1.453023	1.023631	2.690649	2.086427
128	2	1.502191	1.023050	2.881370	2.623525
128	4	1.509589	1.025213	2.970286	3.469111
128	8	1.549430	1.032561	2.958842	3.456083
128	16	1.560619	1.035725	2.966853	3.461197
128	32	1.565613	1.037580	2.979216	3.461016
128	64	1.562321	1.036385	3.018421	3.487620
128	128	1.564134	1.046548	3.093920	3.529567
128	256	1.569323	1.055802	3.158427	3.568132
128	512	1.587742	1.085573	3.297440	3.737223
128	1024	1.623137	1.165268	3.536560	3.838918
128	2048	1.721122	1.441099	4.393200	4.023174
128	4096	1.941110	2.025127	6.513865	5.991636
128	8192	2.717290	3.156197	11.109573	9.710951
128	16384	4.957720	5.841623	20.513093	19.146407
256	1	1.493269	1.031297	2.813659	2.569373
256	2	1.503021	1.033991	2.881263	3.465180
256	4	1.523203	1.029978	2.963893	3.453150
256	8	1.613259	1.036595	2.957579	3.465483
256	16	1.629727	1.044609	2.989868	3.466683
256	32	1.632352	1.048610	3.015526	3.491033
256	64	1.632348	1.043375	3.083132	3.517918
256	128	1.627357	1.049786	3.158427	3.562083
256	256	1.635622	1.070493	3.317974	3.724180
256	512	1.660629	1.096066	3.529316	3.823884
256	1024	1.698602	1.176087	4.392987	4.031836
256	2048	1.908339	1.475349	6.517553	6.001262
256	4096	2.393894	2.073650	11.060907	9.847213
256	8192	4.742189	4.138124	20.497467	19.360554
256	16384	7.464768	6.823438	37.542733	36.374718
512	1	1.640000	1.032392	2.840718	3.352254
512	2	1.646201	1.029739	2.942000	3.450450
512	4	1.673959	1.036462	2.940267	3.399091
512	8	1.812954	1.040000	2.982378	3.458252
512	16	1.812832	1.043403	3.011333	3.476167
512	32	1.820072	1.040862	3.083919	3.511279
512	64	1.821294	1.045258	3.157684	3.558164
512	128	1.828406	1.065219	3.325315	3.711719
512	256	1.843185	1.090376	3.535657	3.822081
512	512	1.863613	1.173345	4.393135	4.019737
512	1024	1.932815	1.481121	6.518250	5.994115
512	2048	2.475768	2.018677	11.018027	10.348198
512	4096	5.049683	4.044462	20.509706	20.092819
512	8192	7.490104	6.604655	37.552732	36.982570
512	16384	12.875270	11.570343	71.475271	70.465395
1024	1	2.098284	1.031026	2.829718	3.322475
1024	2	2.148030	1.037002	2.941216	3.445124
1024	4	2.205063	1.039145	2.984213	3.446770
1024	8	2.326964	1.042905	3.017680	3.482164
1024	16	2.350025	1.038645	3.082880	3.509426
1024	32	2.357849	1.043327	3.185493	3.562909
1024	64	2.384495	1.063795	3.311520	3.709295
1024	128	2.397448	1.089555	3.529307	3.813200
1024	256	2.394682	1.165577	4.395813	4.024131
1024	512	2.443349	1.474329	6.522480	5.993570
1024	1024	3.948729	2.061465	11.022880	10.497917
1024	2048	6.112113	4.064229	20.509120	20.146477
1024	4096	8.128917	6.614896	37.534833	37.019556
1024	8192	13.480798	11.590386	71.482375	70.432544
1024	16384	24.767629	21.567357	139.389059	138.508355

e2e throughput gain for Llama 3.1 8B on B200 in send_one: 248.5 -> 254.5

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments (/tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci) or contact authorized users to do so.
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-01-01T17:53:59Z

Summary of Changes

Hello @DarkSharpness, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request aims to dramatically enhance the performance of KV cache operations in SGLang by integrating a new JIT-compiled CUDA kernel. The current PyTorch-based approach is identified as a bottleneck, and this change introduces highly optimized memory copy operations for KV cache storage. The improvements are particularly beneficial for large language models, and the PR provides extensive benchmarks validating the significant latency reductions achieved. Additionally, it includes minor corrections and structural improvements to the QK-Norm kernel.

Highlights

JIT KV Cache Support: Introduces a Just-In-Time (JIT) compiled kernel for storing key-value (KV) cache, significantly improving efficiency over the existing PyTorch native API.
Performance Optimizations: Implements aggressive optimizations tailored for scenarios with large 'num_kv_head * head_dim' (e.g., Llama 3.1 8B), leading to substantial latency reductions as demonstrated by benchmarks.
QK-Norm Refactoring & Fixes: Addresses minor errors in the 'qknorm' functionality and reorganizes its underlying CUDA kernel file by moving 'norm.cuh' to 'elementwise/qknorm.cuh'.
Comprehensive Benchmarking: Includes detailed performance benchmarks comparing the new SGL JIT Kernel against SGL AOT Kernel, PyTorch Compile, and PyTorch 2 Stream, showcasing the JIT kernel's superior performance across various batch and item sizes.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a JIT-compiled kernel for setting the KV cache, aiming to improve performance over the existing PyTorch-based implementation. The changes include the new CUDA kernel, its Python interface, corresponding benchmarks, and unit tests. Additionally, it refactors the qknorm kernel by renaming and moving files.

My review focuses on the new JIT kernel implementation and its usage. I've identified a potential performance improvement in the Python wrapper for the new kernel and a minor style issue in one of the benchmark files. Overall, the changes look good and the performance gains shown in the benchmarks are impressive.

python/sglang/jit_kernel/benchmark/bench_qknorm.py

python/sglang/jit_kernel/kvcache.py

DarkSharpness · 2026-01-01T18:28:13Z

/tag-and-rerun-ci

python/sglang/jit_kernel/kvcache.py

python/sglang/jit_kernel/csrc/elementwise/kvcache.cuh

merrymercy · 2026-01-02T07:11:57Z

Can it provide similar acceleration to

sglang/python/sglang/srt/mem_cache/memory_pool.py

Line 1626 in 0270426

set_mla_kv_buffer_triton(

?

python/sglang/jit_kernel/kvcache.py

This reverts commit d112f6a.

DarkSharpness requested a review from BBuf as a code owner January 1, 2026 17:53

gemini-code-assist bot reviewed Jan 1, 2026

View reviewed changes

python/sglang/jit_kernel/benchmark/bench_qknorm.py Show resolved Hide resolved

python/sglang/jit_kernel/kvcache.py Show resolved Hide resolved

DarkSharpness requested review from Ying1123, hanming-lu, hnyls2002, merrymercy, xiezhq-hermann and yizhang2077 as code owners January 1, 2026 18:23

github-actions bot added the run-ci label Jan 1, 2026

merrymercy reviewed Jan 2, 2026

View reviewed changes

python/sglang/jit_kernel/kvcache.py Outdated Show resolved Hide resolved

python/sglang/jit_kernel/csrc/elementwise/kvcache.cuh Show resolved Hide resolved

DarkSharpness force-pushed the jit_set_kv branch from d558c86 to a23c55a Compare January 2, 2026 07:11

merrymercy reviewed Jan 2, 2026

View reviewed changes

python/sglang/jit_kernel/kvcache.py Outdated Show resolved Hide resolved

merrymercy approved these changes Jan 2, 2026

View reviewed changes

DarkSharpness force-pushed the jit_set_kv branch from 659e8aa to 562f58e Compare January 5, 2026 08:16

BBuf approved these changes Jan 8, 2026

View reviewed changes

BBuf self-assigned this Jan 8, 2026

DarkSharpness force-pushed the jit_set_kv branch from a5066bd to 9624a7e Compare January 8, 2026 05:58

DarkSharpness added the high priority label Jan 8, 2026

DarkSharpness force-pushed the jit_set_kv branch from 9624a7e to 5aabdab Compare January 8, 2026 18:48

DarkSharpness added 7 commits January 9, 2026 15:36

feat: support jit set kv cache

b479f58

feat: integrate into srt

f44b77a

fix: fix incontiguous loc in speculative decoding

35ade9f

minor: restrict to cuda for now

53b7d49

minor: rename element_dim -> row_dim

39ba411

minor: only reshape tensor when needed

ccdccc9

fix: disable JIT kernel when k_head_dim != v_head_dim

d6b01d7

DarkSharpness force-pushed the jit_set_kv branch from 5aabdab to d6b01d7 Compare January 9, 2026 07:36

merrymercy merged commit d112f6a into sgl-project:main Jan 11, 2026
203 of 216 checks passed

DarkSharpness deleted the jit_set_kv branch January 11, 2026 04:11

hnyls2002 added a commit that referenced this pull request Jan 11, 2026

Revert "[Feature] Support JIT set kv cache (#16273)"

314e4e8

This reverts commit d112f6a.

This was referenced Jan 11, 2026

Fix wrong kernel selection for int32/int64 indices #16912

Merged

[Model] Support IQuest-Coder-40B-Loop #16348

Merged

gongwei-130 mentioned this pull request Jan 13, 2026

[Bug] llama4 maverick model load failed. #17003

Closed

5 tasks

DarkSharpness mentioned this pull request Jan 13, 2026

[Roadmap] JIT kernel development #17035

Open

21 tasks

DarkSharpness mentioned this pull request Feb 26, 2026

[JIT Kernel] Migrate store_kv_cache to JIT kernel #19298

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Support JIT set kv cache#16273

[Feature] Support JIT set kv cache#16273
merrymercy merged 7 commits intosgl-project:mainfrom
DarkSharpness:jit_set_kv

DarkSharpness commented Jan 1, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Jan 1, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

DarkSharpness commented Jan 1, 2026

Uh oh!

Uh oh!

Uh oh!

merrymercy commented Jan 2, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

DarkSharpness commented Jan 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist bot commented Jan 1, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

DarkSharpness commented Jan 1, 2026

Uh oh!

Uh oh!

Uh oh!

merrymercy commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

DarkSharpness commented Jan 1, 2026 •

edited

Loading

merrymercy commented Jan 2, 2026 •

edited

Loading