[JIT Kernel] Migrate store_kv_cache to JIT kernel#19298
Closed
Johnsonms wants to merge 2 commits intosgl-project:mainfrom
Closed
[JIT Kernel] Migrate store_kv_cache to JIT kernel#19298Johnsonms wants to merge 2 commits intosgl-project:mainfrom
Johnsonms wants to merge 2 commits intosgl-project:mainfrom
Conversation
Adds JIT-compiled store_kv_cache as the primary implementation in sgl_kernel.memory, with fallback to the AOT sgl_kernel op. - csrc/memory/store.cuh: CUDA kernels + TVM FFI wrapper adapted from sgl-kernel/csrc/memory/store.cu; dispatches on int32/int64 index dtype and on 256/128-byte-aligned head dim - jit_kernel/store.py: Python JIT loader exposing store_kv_cache - tests/test_store_kv_cache.py: correctness tests across dtypes, index dtypes, batch sizes and head dims (120 cases) - sgl_kernel/memory.py: try JIT first, fall back to sgl_kernel AOT op
- benchmark/bench_store_kv_cache.py: latency benchmark comparing JIT vs AOT store_kv_cache across item sizes and batch sizes - csrc/memory/store.cuh: apply clang-format to kernel launch call sites
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Collaborator
|
Thanks for the PR. Unfortunately, we already have this in https://github.com/sgl-project/sglang/blob/main/python/sglang/jit_kernel/csrc/elementwise/kvcache.cuh |
Collaborator
|
Closed due to duplicate of #16273 . Feel free to reopen if this PR is mistakenly closed. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
#17865
store_kv_cacheis currently implemented as an AOT (ahead-of-time) compiled kernel insgl_kernel. This PR migrates it to the JIT kernel system, consistent with the ongoing effort toslim down
sgl_kerneland move kernels to JIT compilation. The JIT approach compiles for theexact target architecture at runtime, reducing package size and improving maintainability.
Modifications
Adds JIT-compiled
store_kv_cacheas the primary implementation insgl_kernel.memory, withfallback to the AOT
sgl_kernelop.csrc/memory/store.cuh: CUDA kernels + TVM FFI wrapper adapted fromsgl-kernel/csrc/memory/store.cu; dispatches onint32/int64index dtype and on256/128-byte-aligned head dim
jit_kernel/store.py: Python JIT loader exposingstore_kv_cachetests/test_store_kv_cache.py: correctness tests across dtypes, index dtypes, batch sizes, andhead dims (120 cases)
benchmark/bench_store_kv_cache.py: latency benchmark comparing JIT vs AOT across item sizesand batch sizes
sgl_kernel/memory.py: try JIT first, fall back to AOTtorch.ops.sgl_kernel.store_kv_cacheAccuracy Tests
python -m pytest python/sglang/jit_kernel/tests/test_store_kv_cache.py -vpython python/sglang/jit_kernel/benchmark/bench_store_cache.pypython python/sglang/jit_kernel/benchmark/bench_store_kv_cache.pyVerified in
python/sglang/jit_kernel/tests/test_store_kv_cache.py120 cases, all passing:bash
python -m pytest python/sglang/jit_kernel/tests/test_store_kv_cache.py -v
*** 120 passed
Covers: float16 / bfloat16 / float32 × int32 / int64 indices × batch sizes [1, 4, 16, 64, 128] ×
head dims [64, 128, 256, 512].
Benchmarking and Profiling
Benchmarked on H200, bfloat16, 8 layers, comparing JIT vs AOT (sgl_kernel.set_kv_buffer_kernel):
python python/sglang/jit_kernel/benchmark/bench_store_kv_cache.py
JIT and AOT show comparable latency (direct port of the same algorithm), confirming no regression.
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci