Add fused FP8 KV cache write kernel for TRTLLM MHA backend by harvenstar · Pull Request #14093 · sgl-project/sglang

harvenstar · 2025-11-28T08:35:59Z

Summary

This PR introduces a fused FP8 KV cache write kernel for the TRTLLM MHA backend.
It combines FP8 quantization and paged KV cache writes into a single Triton kernel (plus a naive fallback) to reduce kernel launch overhead and avoid intermediate FP8 tensors.

Motivation

The current FP8 KV cache path launches multiple small elementwise kernels per step:

FP8 quantization of K
FP8 quantization of V
Writing K to the paged KV cache
Writing V to the paged KV cache

These kernels all operate on the same data and are memory bound.
The extra launches and intermediate tensors add overhead without bringing additional parallelism.

This PR fuses FP8 quantization and KV cache writes into one kernel for the TRTLLM backend.
RoPE and attention logic remain unchanged, but the extra FP8 elementwise kernels in the KV write path are removed.

Changes

New fused FP8 KV kernel

File added

python/sglang/srt/layers/attention/trtllm_fp8_kv_kernel.py

Triton kernel: _fused_fp8_set_kv_buffer_kernel

Takes post RoPE K and V in BF16 or FP16.
Applies FP8 quantization using provided scalar scales.
Writes directly into the paged FP8 KV cache layout.
Respects the existing uint8 backed FP8 storage convention.

Python wrapper: fused_fp8_set_kv_buffer

Supports both:
- 3D cache layout: [total_slots, num_kv_heads, head_dim]
- 4D cache layout: [num_pages, page_size, num_kv_heads, head_dim]
Handles both 2D and 3D K, V inputs via runtime shape checks.
Provides a naive fallback that exactly reproduces MHATokenToKVPool.set_kv_buffer semantics.
Naive path is used as the golden implementation in unit tests.

Integration into TRTLLM attention backend

Wires fused_fp8_set_kv_buffer into:
- TRTLLMHAAttnBackend.forward_decode
- TRTLLMHAAttnBackend.forward_extend
  when kv_dtype is FP8.
Adds a feature flag style switch so non FP8 configurations are unaffected.
Keeps the original code path as a correctness reference.
Removes or downgrades all development-only debug logging so runtime logs remain minimal.

Accuracy and Testing

The fused path is designed to be numerically equivalent to the existing FP8 KV cache behavior.

Unit tests

Naive fused implementation vs MHATokenToKVPool.set_kv_buffer
- Bit exact match at the written KV cache locations for both K and V.
Triton implementation vs naive implementation
- Matches the naive path when using the same scalar FP8 scales.
- Uses the same quantization formula as the reference path.

End to end evaluation (GSM8K)

Setup:
- Qwen3 MoE, FP8 KV cache, TRTLLM backend.
- bench_sglang.py, 8 shot, 1319 questions.
Results:
- Baseline accuracy: ~0.956
- Fused FP8 KV cache accuracy: ~0.958
- Invalid rate: 0.0

No regression on GSM8K accuracy was observed.

Benchmarking and Profiling

Nsight profiling shows that the four small FP8 KV elementwise kernels in the TRTLLM path are replaced by a single fused kernel in forward_decode and forward_extend when FP8 KV cache is enabled.
KV write latency is reduced and the number of kernel launches per decode step is lower.
Overall GSM8K benchmark latency remains in the same range, with KV cache writes no longer dominating the FP8 KV path.

Exact speedups depend on GPU, model, batch size, and sequence length.
The main benefit of this PR is to remove unnecessary elementwise launches in the TRTLLM FP8 KV write path while keeping behavior identical.

This PR introduces a fused kernel that combines FP8 quantization and KV cache writes to reduce launch overhead and improve memory efficiency for the TRTLLM backend. Key changes: - Add trtllm_fp8_kv_kernel.py with Triton-based fused kernel - Integrate fused path into TRTLLMHAAttnBackend forward_decode/forward_extend - Add runtime detection in qwen3_moe.py to enable fusion for FP8 KV cache The kernel writes post-RoPE K/V directly into paged FP8 cache layout, eliminating intermediate tensors. Supports both 3D and 4D cache layouts with runtime strides. Validation: Unit tests show bit-exact match with original implementation. End-to-end GSM8K benchmark maintains accuracy (~0.956-0.958) with reduced KV write latency.

gemini-code-assist · 2025-11-28T08:36:19Z

Summary of Changes

Hello @harvenstar, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly optimizes the FP8 KV cache write process within the TRTLLM MHA backend. By fusing the FP8 quantization and paged KV cache write operations into a single Triton kernel, it addresses performance bottlenecks caused by multiple small kernel launches and the creation of intermediate tensors. This change aims to enhance the efficiency of the KV cache write path, particularly for FP8 configurations, while maintaining numerical accuracy and overall system performance.

Highlights

Fused FP8 KV Cache Write: Introduces a new Triton kernel that combines FP8 quantization and paged KV cache writes for the TRTLLM MHA backend.
Performance Optimization: Reduces kernel launch overhead and eliminates intermediate FP8 tensors, replacing multiple small elementwise kernels with a single fused operation.
New Kernel Implementation: Adds trtllm_fp8_kv_kernel.py containing the Triton kernel _fused_fp8_set_kv_buffer_kernel and its Python wrapper fused_fp8_set_kv_buffer.
Backend Integration: Integrates the fused kernel into TRTLLMHAAttnBackend.forward_decode and forward_extend for FP8 KV cache configurations, controlled by a feature flag.
Correctness and Accuracy: Includes a naive fallback implementation for unit testing and demonstrates no regression in accuracy on GSM8K benchmarks.
Benchmarking: Nsight profiling confirms reduced KV write latency and fewer kernel launches, improving efficiency in the FP8 KV write path.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a fused FP8 KV cache write kernel for the TRTLLM MHA backend, which is a great performance optimization. The implementation is well-structured, including a Triton kernel for performance and a naive Python fallback for correctness testing. My review focuses on improving code maintainability and efficiency in a few areas. I've suggested refactoring duplicated code in the Triton kernel and the Python backend, vectorizing a loop in the naive implementation, and improving a type check to be more robust. Overall, this is a solid contribution.

python/sglang/srt/layers/attention/trtllm_fp8_kv_kernel.py

python/sglang/srt/layers/attention/trtllm_mha_backend.py

python/sglang/srt/models/qwen3_moe.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

- Remove unused FP8_E4M3_MAX constant - Vectorize 4D cache write in naive fallback implementation - Extract duplicated fused path check logic into _should_use_fused_fp8_path method

Extract duplicated K and V tensor processing logic into _process_kv_tensor to improve code maintainability and reduce duplication. The helper function is JIT-compiled and will be inlined by Triton for optimal performance. Addresses Gemini Code Assist feedback about code duplication.

Qiaolin-Yu · 2025-11-30T22:39:49Z

/tag-and-rerun-ci

Qiaolin-Yu · 2025-12-01T23:55:35Z

python/sglang/srt/models/qwen3_moe.py

+        attn_extra_kwargs = {}
+
+        # Enable TRTLLM FP8 fusion if conditions are met
+        if _should_enable_trtllm_fp8_fuse(forward_batch):


Why do we need to check it here?

Good catch.

Qiaolin-Yu · 2025-12-01T23:58:39Z

python/sglang/srt/layers/attention/trtllm_mha_backend.py

        """Get the fill value for sequence lengths in CUDA graph."""
        return 1

+    def _should_use_fused_fp8_path(


So it's only enabled for Qwen? Is it possible to make it more general so that other models can use it?

The check should be in the backend, not model-specific code. I'll move the fusion detection
logic entirely into TRTLLMHAAttnBackend._should_use_fused_fp8_path() and remove all the qwen3_moe
changes. This way any model using TRTLLM backend with FP8 KV cache gets the optimization automatically. Does the change sound good to you?

Done.✅ Fusion logic now lives in backend, and work for all models.

- Delete model-specific _should_enable_trtllm_fp8_fuse() function - Simplify TRTLLMHAAttnBackend._should_use_fused_fp8_path() to check FP8 dtype directly - Remove attn_extra_kwargs mechanism from Qwen3MoeAttention - This makes FP8 KV cache fusion available to all models using TRTLLM backend

ispobock

share profile figure comparison to see the improvement after fusion
share end to end throughput improvement on Qwen3-235B

qq: Why trtllm_mla don't need this fusion?

ispobock · 2025-12-03T14:35:16Z

python/sglang/srt/layers/attention/trtllm_fp8_kv_kernel.py

+    BLOCK_DIM: tl.constexpr,
+):
+    """Process a single K or V tensor: load, quantize to FP8, and write to cache."""
+    for head_idx in range(0, num_kv_heads, BLOCK_HEAD):


Why don't parallelize heads? (i.e. add head to grid)

Done. Moved head iteration to grid dimension via head_block_id = tl.program_id(1).

ispobock · 2025-12-03T14:36:15Z

python/sglang/srt/layers/attention/trtllm_fp8_kv_kernel.py

+    v_inv_scale = 1.0 / v_scale if use_provided_scale else 1.0
+
+    # Process K tensor
+    _process_kv_tensor(


Can we also parallelize k and v?

Done. Grid is now (tokens, head_blocks, 2) where kv_idx = tl.program_id(2) selects K or V.

ispobock · 2025-12-03T14:39:06Z

python/sglang/srt/layers/attention/trtllm_fp8_kv_kernel.py

+        # Grid: one program per token
+        grid = (num_tokens,)
+
+        if not hasattr(fused_fp8_set_kv_buffer, "_triton_logged"):


It seems not needed?

Removed in latest commit.

ispobock · 2025-12-03T14:39:12Z

python/sglang/srt/layers/attention/trtllm_fp8_kv_kernel.py

+        )
+    else:
+        # Fallback to naive implementation
+        if not hasattr(fused_fp8_set_kv_buffer, "_naive_logged"):


It seems not needed?

Removed in latest commit.

ispobock · 2025-12-03T14:40:01Z

python/sglang/srt/layers/attention/trtllm_fp8_kv_kernel.py

+    )
+
+
+def fused_fp8_set_kv_buffer(


Could you add unit test for this kernel?

Added test/srt/test_trtllm_fp8_kv_kernel.py with serveral test cases covering
different input/cache dimensions, scaling options, and edge cases. All
passing.

harvenstar · 2025-12-03T15:30:26Z

share profile figure comparison to see the improvement after fusion

share end to end throughput improvement on Qwen3-235B

qq: Why trtllm_mla don't need this fusion?

Thank you for reviewing. Here is the analysis and comparison of the performance metrics,.

1. Throughput Improvement Summary

The fusion optimization led to a noticeable improvement in overall token throughput (tok/s) and a reduction in inter-token latency (ITL) on the Qwen3-235B model with FP8 KV cache enabled.

Metric	Before Fusion	After Fusion	Improvement (%)
Total token throughput (tok/s)	1635.66	1693.52	+3.54%
Output token throughput (tok/s)	818.01	846.94	+3.54%
Mean TPOT (Time per Output Token) (ms)	19.07	18.42	-3.41% (Faster)
Mean ITL (Inter-Token Latency) (ms)	19.08	18.43	-3.41% (Faster)
Request throughput (req/s)	0.81	0.84	+3.70%
Mean E2E Latency (ms)	19737.67	19064.12	-3.41% (Faster)

2. Full Benchmark Results and Profile Figure (For Reference)

To maintain the exact formatting and alignment of the raw data, the full results are enclosed in separate code blocks.

Before Fusion

After Fusion

Before Fusion

============ Serving Benchmark Result ============
Backend:                     sglang-oai
Traffic request rate:        inf
Max request concurrency:     16
Successful requests:         80
Benchmark duration (s):      99.13
Total input tokens:          81050
Total input text tokens:     81050
Total input vision tokens:   0
Total generated tokens:      81085
Total generated tokens (retokenized): 81047
Request throughput (req/s):  0.81
Input token throughput (tok/s): 817.65
Output token throughput (tok/s): 818.01
Peak output token throughput (tok/s): 992.00
Peak concurrent requests:    23
Total token throughput (tok/s): 1635.66
Concurrency:                 15.93
----------------End-to-End Latency----------------
Mean E2E Latency (ms):       19737.67
Median E2E Latency (ms):     19797.55
---------------Time to First Token----------------
Mean TTFT (ms):              430.65
Median TTFT (ms):            303.05
P99 TTFT (ms):               842.41
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):              19.07
Median TPOT (ms):            19.15
P99 TPOT (ms):               20.51
---------------Inter-Token Latency----------------
Mean ITL (ms):               19.08
Median ITL (ms):             17.27
P95 ITL (ms):                17.68
P99 ITL (ms):                34.68
Max ITL (ms):                473.31

After Fusion

============ Serving Benchmark Result ============
Backend:                                 sglang-oai
Traffic request rate:                    inf       
Max request concurrency:                 16        
Successful requests:                     80        
Benchmark duration (s):                  95.74     
Total input tokens:                      81050     
Total input text tokens:                 81050     
Total input vision tokens:               0         
Total generated tokens:                  81085     
Total generated tokens (retokenized):    81047     
Request throughput (req/s):              0.84      
Input token throughput (tok/s):          846.58    
Output token throughput (tok/s):         846.94    
Peak output token throughput (tok/s):    1040.00   
Peak concurrent requests:                23        
Total token throughput (tok/s):          1693.52   
Concurrency:                             15.93     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   19064.12  
Median E2E Latency (ms):                 19127.56  
---------------Time to First Token----------------
Mean TTFT (ms):                          415.68    
Median TTFT (ms):                        305.58    
P99 TTFT (ms):                           764.02    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          18.42     
Median TPOT (ms):                        18.50     
P99 TPOT (ms):                           19.88     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           18.43     
Median ITL (ms):                         16.62     
P95 ITL (ms):                            16.98     
P99 ITL (ms):                            33.48     
Max ITL (ms):                            459.80    
==================================================

- Parallelize heads dimension: add head_block_id to grid - Parallelize K/V processing: grid=(tokens, head_blocks, 2) - Remove unnecessary logging code - Add comprehensive unit tests (10 test cases, all passing) - Store FP8 directly without uint8 conversion Addresses ispobock review comments

ispobock · 2025-12-04T14:05:43Z

/tag-and-rerun-ci

harvenstar · 2025-12-04T20:20:36Z

/tag-and-rerun-ci

…14093)" This reverts commit 38daa29.

Fix B200 CI failure that caused sgl-project#14093 to be reverted in sgl-project#14550. Original issue discovered in CI run: https://github.com/sgl-project/sglang/actions/runs/19985523381/job/57335123826?pr=14493 Root cause: - layer.k_scale/v_scale are torch.nn.Parameter (0-D tensors) - Triton interprets tensor arguments as pointers - Expression '1.0 / k_scale' in kernel caused IncompatibleTypeError Solution: - Convert tensor scales to Python float in wrapper before kernel launch - Handles None, Tensor, and scalar cases explicitly - Preserves per-tensor scaling semantics Testing: - Added regression test test_fp8_kv_kernel_accepts_tensor_scales - Verified on local environment with ATTN_BACKEND=trtllm_mha - All 11 kernel tests pass, including new regression test - test_llama31_fp4.py::TestLlama31FP4B200::test_gsm8k passes

…gl-project#14093)" This reverts commit 38daa29.

…ct#14093) Co-authored-by: Qiaolin Yu <liin1211@outlook.com>

* [model-gateway] extract conversation out of oai router (sgl-project#14440) Co-authored-by: key4ng <rukeyang@gmail.com> * [DeepseekV3.2][NSA][Indexer] Fix PAGED top-k transform for NSA indexer chunked execution on H200 (sgl-project#14325) * [model-gateway] move oai header util to router header util (sgl-project#14441) Co-authored-by: key4ng <rukeyang@gmail.com> * [FIX] trtllm-moe-fp4-renorm for Qwen series models (sgl-project#14350) * add doc for quantized kv cache (sgl-project#14348) Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com> Co-authored-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com> * fix: Correct environment variable syntax in docker-compose configuration (sgl-project#8287) Signed-off-by: Kay Yan <kay.yan@daocloud.io> * [model-gateway] move all responses api event from oai to proto (sgl-project#14446) Co-authored-by: key4ng <rukeyang@gmail.com> * [model-gateway] add mistral 3 image processor (sgl-project#14445) Co-authored-by: Chang Su <chang.s.su@oracle.com> * [model-gateway] grpc to leverage event type (sgl-project#14450) Co-authored-by: Chang Su <chang.s.su@oracle.com> * ministral3 (sgl-project#14251) Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com> Co-authored-by: Yueming Yuan <yy28@illinois.edu> * [Bug] fix not desired disable fused share experts caused by rocm logic (sgl-project#14432) * Rename secrets.WHL_TOKEN -> secrets.GH_PAT_FOR_WHL_RELEASE (sgl-project#14421) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> * [diffusion] improve: further optimize model load (sgl-project#13836) * Add CI permissions for user 'yushengsu-thu' (sgl-project#14468) * [ez] Fix typing (sgl-project#14473) * Add AMD stage support to /rerun-stage command and fix related bugs (sgl-project#14463) * Add YAMY1234 to CI Permission (sgl-project#14475) * clean up gemlite usage (sgl-project#14444) * [diffusion] chore: further improve model searching logic (sgl-project#14484) * [diffusion] fix: fix bug about pin memory when offloading (sgl-project#14472) * [diffusion] cli: add argument --adjust-frames and --override-protected-fields (sgl-project#13996) Co-authored-by: dev <devnull@example.com> Co-authored-by: Mick <mickjagger19@icloud.com> * dockerfile: add runtime stage + ubuntu 24.04 (sgl-project#13861) * [diffusion] fix: fix CLIP text encoder attention mask not used (sgl-project#14364) Co-authored-by: niehen6174 <niehen.6174@gmail.com> Co-authored-by: Mick <mickjagger19@icloud.com> * Enable RadixCache for Mamba2 models (sgl-project#13584) * [diffusion] fix: Fix profiler trace missing Python stack in diffusion pipeline (sgl-project#14499) * support GLM-V vision model dp (sgl-project#14097) * [misc] add model arch and type to server info and use it for harmony (sgl-project#14456) * Add Mistral Large 3 Eagle Support (sgl-project#14466) Co-authored-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com> * Add Mistral Large 3 to nightly CI tests (sgl-project#14459) * [diffusion] chore: set allowing overriding protected fields of sampling params as default behavior (sgl-project#14471) * [model-gateway] move conversation to first class routing (sgl-project#14506) Co-authored-by: key4ng <rukeyang@gmail.com> * [Spec] Mamba2 support in target models (sgl-project#13434) * [diffusion] feat: support cache-dit integration (sgl-project#14234) Co-authored-by: shuxiguo <shuxiguo@meituan.com> Co-authored-by: DefTruth <qiustudent_r@163.com> Co-authored-by: Mick <mickjagger19@icloud.com> * Add fused FP8 KV cache write kernel for TRTLLM MHA backend (sgl-project#14093) Co-authored-by: Qiaolin Yu <liin1211@outlook.com> * [model-gateway] Add WASM support for middleware (sgl-project#12471) Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> * [model-gateway] reorganized conversation handler (sgl-project#14507) Co-authored-by: key4ng <rukeyang@gmail.com> * tiny remove deprecated endpoint call (sgl-project#13607) * [model-gateway] fix server info comment (sgl-project#14508) * Add Mistral Large 3 basic test to PR CI (sgl-project#14460) * Fix removing worker will make it healthy forever in prometheus metrics (sgl-project#14420) * [model-gateway] Make Tokenizer Builder Aware of Env Vars Like HF_ENDPOINT (sgl-project#14405) * [model-gateway] change sgl-router to sgl-model-gateway (sgl-project#14312) * [model-gateway] fix left over sgl-router names to sgl-model-gateway (sgl-project#14512) * [model-gateway] fix logs in smg workflow (sgl-project#14513) * [model-gateway] fix left over sgl-router names in wasm (sgl-project#14514) * [model-gateway] fix code owner for wasm (sgl-project#14516) * chore: bump sgl-kernel version to 0.3.18.post3 (sgl-project#14427) Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com> * Tiny use trtllm_mha as default when possible (sgl-project#14291) * [Docs] Add /rerun-stage command to contribution guide (sgl-project#14521) * Fix safetensors validation to catch corruption after download (sgl-project#14465) * [CODEOWNER] update codeowner for qwen3-next related (sgl-project#14522) * fix: fix rmsnorm -> layernorm in qwen3 omni (sgl-project#11791) Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com> * [diffusion] chore: temporarily upgrade diffusers to make Z-image compatible with Cache-DiT (sgl-project#14530) * [bug] fix notebook to include new keys from model_info (sgl-project#14528) * Revise DP Multi-Modal Encoder Document (sgl-project#14290) * [CPU] add mamba fla kernels for Qwen3-next (sgl-project#12324) * Revert "tiny remove deprecated endpoint call" (sgl-project#14533) * support mtp with deepseek r1 nvfp4 model (sgl-project#13115) Co-authored-by: Trevor Morris <tmorris@nvidia.com> * [diffusion] refactor: simplify sampling params' override logic (sgl-project#14539) * [diffusion] perf: add QKV fusion optimization for Flux models (sgl-project#14505) Co-authored-by: Mick <mickjagger19@icloud.com> * [model-gateway][tracing]: implement request tracing using OpenTelemetry with trace context propagation (HTTP) (sgl-project#13897) * [diffusion] lora: fix LoRA dtype handling and weight attribute access for z-image model (sgl-project#14543) Co-authored-by: niehen6174 <nihen6174@gmail.com> * fix "GrammarMatcher has terminated after accepting the stop token, but is trying to find the next token mask" when both reasoning and spec are enabled (sgl-project#14464) * [1/n] Fix hanging during DeepGemm Warmup (sgl-project#14493) * [Bug fix] Add /model_info endpoint to mini_lb (sgl-project#14535) * [Qwen3-next] remove heuristics and add radix cache kl test (sgl-project#14520) * [Misc]Register and refactor some environs for dpsk-fp4 and DeepEp (sgl-project#14538) * chore: bump sgl-kernel version to 0.3.18.post3 (sgl-project#14518) * Update CI_PERMISSIONS.json (sgl-project#14552) * Update DeepSeek V3 docs to use B200 (sgl-project#14447) * [Doc] Add short explanation on page size (sgl-project#14557) * [docs] Add missing word in argument description (sgl-project#14205) * support piecewise cuda graph for Olmo models (sgl-project#14476) * Enhance prefill PP node robustness (sgl-project#14494) * DOC update nemo-skills in docs (sgl-project#14555) Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> * remove unecessary dual stream token threshold from the rest of models (qwen moe, kimi linear, etc.) (sgl-project#14337) * feat(ci): add framework target to release-docker workflows (sgl-project#14559) * Fix attention backend logic for Qwen3-Next on SM100 (sgl-project#14560) * [FLA] Add explicit kernel arguments to kda.py for Kimi Linear support (sgl-project#14561) * Add CUDA kernel size analysis tool for sgl-kernel optimization (sgl-project#14544) * [DLLM] feat: Add threshold based parallel decoding support (sgl-project#14412) Co-authored-by: Jinwei Yao <jinweiy@illinois.edu> Co-authored-by: 赵晨阳 <zhaochen20@outlook.com> * Add unit-test-backend-8-gpu-b200 to rerun-stage command (sgl-project#14569) * [apply][2/2] Fused qk_norm_rope for Qwen3-MoE (sgl-project#13998) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> * Add Expert Parallelism (EP) support for kimi-k2-thinking (sgl-project#13725) * Tiny remove wrong import from `python.sglang` (sgl-project#14577) * Add small model test for spec v2 + dp + trtllm_mla (sgl-project#14576) * [diffusion] cli: profiling utilities support (sgl-project#14185) Co-authored-by: jianyingzhu <53300651@qq.com> Co-authored-by: Jianying <53503712+jianyingzhu@users.noreply.github.com> Co-authored-by: Mick <mickjagger19@icloud.com> * [NPU]LoRA: Adding Torch Native backend (sgl-project#14132) * [BugFix] fix prefixcache performance and accuracy on ascend (sgl-project#13573) * Fix FP8 KV Triton type issue and add regression test (sgl-project#14553) * Rename TensorRT Model Optimizer to Model Optimizer (sgl-project#14455) Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com> * [CI] Tiny speed up VLM CI (sgl-project#14517) Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com> * [Minor] Temporarily skipping deepep large mtp test (sgl-project#14586) * [model-gateway] extra accumulator and tool handler in oai router (sgl-project#14587) * [model-gateway] Fixed WASM Security Vulnerability - Execution Timeout (sgl-project#14588) * [model-gateway] reorganize metrics, logging, and otel to its own module (sgl-project#14590) * Refactor tuning block wise kernel and opt Qwen/Qwen3-VL-32B-Instruct-FP8 (sgl-project#14141) * [CI]Unblock and split spec v2+dp test (sgl-project#14551) * [Tool Call] Fix DeepSeekV32Detector skipping functions with no params in streaming mode (sgl-project#14573) * [feat] use cachebuffer to store mm feature to speedup hash (sgl-project#14386) * [CI] Fix unit-test-backend-8-gpu-b200 running on every /rerun-stage (sgl-project#14591) * [model-gateway] fix WASM memory limit per module (sgl-project#14600) * Tiny fix missing policy decision recording (sgl-project#14605) * Super tiny remove unneeded policy flag (sgl-project#14608) * [model-gateway] refactor otel to be more efficient (sgl-project#14604) * Super tiny remove unused select_worker_pair (sgl-project#14609) * [model-gateway] fix WASM unbounded request/response body read vuln (sgl-project#14612) * [2/2] Add rope kernel in sgl-kernel (sgl-project#14452) * [DLLM] Add initial cuda graph support (sgl-project#14203) * Super tiny fix unused code in router (sgl-project#14618) * [Glm46v] Bug fix for accuracy drop and unable to launch server (sgl-project#14585) Co-authored-by: yhyang201 <yhyang201@gmail.com> Co-authored-by: zRzRzRzRzRzRzR <2448370773@qq.com> Co-authored-by: Minglei Zhu <mingleizhu1122@gmail.com> * Fix amd rope definition (sgl-project#14556) * modify the sgl-kernel to be compatible with transformers 5.x. (sgl-project#14625) * [Reasoning + Structured Output] make reasoning compatible with structured output (sgl-project#12551) Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com> Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com> * [diffusion] feat: add support for LoRA layers in transformer_2 within LoRAPipeline (sgl-project#14606) * chore: bump sgl-kernel version to 0.3.19 (sgl-project#14632) * [cpu] Implement all gather/reduce for arm64 cpu (sgl-project#12527) * [diffusion] chore: further refine output resolution adjustment logic (sgl-project#14558) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Fix dp-aware incompatible with service-discovery (sgl-project#14629) * update transformers package version to 5.0.0rc0 (sgl-project#14356) * chore: bump sgl-kernel version to 0.3.19 (sgl-project#14649) * chore: bump SGLang version to 0.5.6.post1 (sgl-project#14651) * [AMD] change fused rms quant interface for aiter upgrade (sgl-project#14497) * [model-gateway] reducing cpu overhead in various of places (sgl-project#14658) * [model-gateway] reduce cpu overhead in grpc router (sgl-project#14663) * [model-gateway] fix WASM arbitrary file read security vol (sgl-project#14664) * vlm: Use fa3 as the default backend for qwen3 vl (sgl-project#14634) * [model-gateway] Optimize memory usage in HTTP router (sgl-project#14667) * fix: use .get() when accessing strict mem-check env variable (sgl-project#14657) * improve default glm mtp setting (sgl-project#14457) Signed-off-by: Brayden Zhong <b8zhong@users.noreply.github.com> * Fix cache-aware router should pick min load instead of min tenant size (sgl-project#14650) * Bump up diffusers to latest official release version (sgl-project#14670) * [model-gateway] add OTEL integration to grpc router (sgl-project#14671) * [CI] Increase max-parallel to 15 for high priority PRs (sgl-project#14675) * [HiCache] fix condition check when use decode offload (sgl-project#14489) * [RadixTree] Optimize the Time Complexity of Node Retrieval Operation from O(n*m) to O(n) (sgl-project#13334) Signed-off-by: CLFutureX <chenyongqyl@163.com> Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> * Tiny support printing requests in bench_serving for observability (sgl-project#14652) * Aiter fp8 kv cache (sgl-project#13147) * [SMG]feat: implement TokenGuardBody for managing token return (sgl-project#14653) * [NPU] chore: bump basic software version to 8.3.rc2 (sgl-project#14614) * [CI] Unblock gb200 cutedsl test (sgl-project#14469) * Add ffmpeg into sglang docker - required by transformers multimodal V… (sgl-project#14679) * [Bugfix] Fix KeyError for Mistral-Large-3 rope_scaling config (sgl-project#14627) * Tiny support sgl-router http response status code metrics (sgl-project#14689) * [CI] Migrate Eagle 1-GPU tests to test/registered/ (sgl-project#14529) * Revert "[Bug] fix not desired disable fused share experts caused by r… (sgl-project#14676) * Add per-request decode tp size (sgl-project#14678) Co-authored-by: Byron Hsu <byronhsu1230@gmail.com> * [ci][smg] fix docker release ci and add it to pr test (sgl-project#14683) * Tiny extract select_worker_min_load (sgl-project#14648) * Fix dp-aware incompatible with completions and chat completions APIs (sgl-project#14647) * [CI] Fix Llama 3.1 8B FP4 CI (sgl-project#14699) * fix: make override DeepseekV2Model work (sgl-project#14707) * chore: add code owners for deepseek_v2.py (sgl-project#14714) * [CI] Move mistral large 3 basic to nightly (sgl-project#14622) * fix the deepep 8 gpu unit test (sgl-project#14601) * Add fuse_marlin_moe test to ci and add new ep test (sgl-project#14686) * [Bugfix] Fix environ error in scheduler_runtime_checker_mixin.py (sgl-project#14461) Signed-off-by: Kun(llfl) <i@imux.top> * [Feat] Add received_time in serving_base (sgl-project#13432) Signed-off-by: zhanghaotong <zhanghaotong.zht@antgroup.com> * fix: prevent HugginqFace access when SGLANG_USE_MODELSCOPE is enabled (sgl-project#12039) * [Test] Skip STANDALONE speculative decoding tests for different hidden sizes (sgl-project#14733) * [diffusion] feat: support comparing batch perf (sgl-project#14738) Co-authored-by: shuxiguo <shuxiguo@meituan.com> Co-authored-by: Mick <mickjagger19@icloud.com> * Revert "[Feat] Add received_time in serving_base" (sgl-project#14743) * [Model] Add PaddleOCR-VL Model Support (sgl-project#12953) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> * fix rope parameter initialization error caused by transformers v5.0 update (sgl-project#14745) * [model-gateway] optimize core modules (sgl-project#14751) * [SMG] perf: optimize tokenizer for reduced CPU and memory overhead (sgl-project#14752) * Add FP8 Blockwise GEMM Backend Flag `--fp8-gemm-backend` (sgl-project#14379) * fix: checking if tokenizer is in cache before downloading from HF (sgl-project#14698) * fix: making rate limit a warning instead of error (sgl-project#14753) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * move multi-item scoring functions in tokenizer manager into a separate file (sgl-project#14740) * Improve CI by trying a warmup before unit tests (sgl-project#14669) * [Perf] Optimize radix tree for cache-aware load balancin (sgl-project#14758) * [Feature] Add LoRA support for embedding layers (sgl-project#14177) Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: Beichen-Ma <bm685@cornell.edu> * [model-gateway] release gateway 0.2.4 (sgl-project#14763) * [ci]: Enable the new hf API (sgl-project#14687) * Re-add the API serving timing metrics. (sgl-project#14744) Signed-off-by: zhanghaotong <zhanghaotong.zht@antgroup.com> Co-authored-by: zhanghaotong <zhanghaotong.zht@antgroup.com> * fix: adding rate limit warning at verify token permission stage (sgl-project#14756) * Disable 8-gpu-b200 runner in PR tests (sgl-project#14768) * [fix] Fix issues for in-flight weight updates (sgl-project#14064) Co-authored-by: 赵晨阳 <zhaochen20@outlook.com> * [Auto Sync] Update data_parallel_controller.py, detokenizer... (20251209) (sgl-project#14759) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> * fix: race condition between validation and download locks (sgl-project#14761) * Fix VLM accuracy thresholds for nightly tests (sgl-project#14777) * fix server args bug (sgl-project#14725) * handling incomplete rope_scaling config ci after transformers upgrade (sgl-project#14784) * fix b200 ci (sgl-project#14786) * [RL] support weight reload for low-bit rollout (sgl-project#9650) Co-authored-by: Hecate0821 <hec4te0821@gmail.com> Co-authored-by: eternally-z <zzywzj@gmail.com> Co-authored-by: Wilboludriver <wilbolu@outlook.com> Co-authored-by: Wilbolu <81792854+Wilboludriver@users.noreply.github.com> Co-authored-by: Ke Bao <ispobaoke@gmail.com> * fix: add missing logic for SGLANG_USE_MODELSCOPE variable (sgl-project#14794) * fix b200 fa4 ci (sgl-project#14788) * [diffusion] profile: early exit when enough steps are captured to reduce the size of the trace file (sgl-project#14803) * [GLM-4.6V] Support Pipeline Parallelism for GLM-4.6V & GLM-4.1V (sgl-project#14720) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> * [diffusion] CI: Add LoRA support to diffusion server configuration and test cases (sgl-project#14697) * Revert "fix: checking if tokenizer is in cache before downloading from HF" (sgl-project#14808) * [diffusion] performance: refactor diffusion fuse qkv and apply to qwen-image (sgl-project#14793) * [SMG-GO] implement a Go SGLang Model Gateway - OpenAI Compatible API Server (sgl-project#14770) * [model-gateway] Dynamically Populate Tool Call Parser Choices (sgl-project#14807) * Support HTTP response status code prometheus metrics (sgl-project#14710) * Fix router keep nonzero metrics after worker is deleted (sgl-project#14819) * Tiny fix incorrect worker removal command (sgl-project#14822) * [NPU] bug fix for mtp and w4a8 (sgl-project#14806) * [CI] fix UT success check in `test_eagle_infer_beta_dp_attention.py` (sgl-project#14831) * Fix CI registry scan to only check test/registered directory (sgl-project#14812) * [model-gateway] add anthropic message api spec (sgl-project#14834) * [diffusion] doc: fix tiny typo in multimodal_gen/README.md (sgl-project#14830) * [model-gateway] support customizing Prometheus duration buckets (sgl-project#14716) * [model-gateway] support engine response http status statistics in router (sgl-project#14712) * [CI] Reduce stage-b auto-partition from 4 to 2 (sgl-project#14769) Co-authored-by: Liangsheng Yin <lsyincs@gmail.com> * Apply back moe_sum_reduce for fused_marlin_moe (sgl-project#14829) * [diffusion] parallel: pad tokens for video models under sp (sgl-project#14833) * [diffusion] CI: use unified sampling_params for CI (sgl-project#14045) * [Auto Sync] Update tool_chat_template_deepseekv31.jinja (20251210) (sgl-project#14837) Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Jue Wang <zjuwangjue@gmail.com> * Revert transformers to 4.57.1 (sgl-project#14801) * [model-gateway] Fix incompatible metric comparison in` PowerOfTwo` policy (sgl-project#14823) * [bugfix] qwen25-VL support lora (sgl-project#14638) * fix lora target all + csgmv backend (sgl-project#14796) * [model-gateway] adds default implementations to RouterTrait in mod.rs (sgl-project#14841) * [AMD] Add model to AMD nightly test (sgl-project#14442) * Treat unittest SkipTest exception as pass instead of as failure (sgl-project#14847) * [model-gateway] code clean up on oai router (sgl-project#14850) * [model-gateway] fix import order in oai conversation (sgl-project#14851) * fix fp8 gemm nightly CI (sgl-project#14844) Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com> * fix: restrict cache validation behaviors to CI only (sgl-project#14849) * Fix CUDA version handling in ci_install_deepep.sh (sgl-project#14854) * Fix TestGLM41VPPAccuracy test flakiness (sgl-project#14848) * Minor code style fix for dllm (sgl-project#14836) * Enable TP for Mamba-based models (sgl-project#14811) Signed-off-by: Roi Koren <roik@nvidia.com> * [CI] Temp disable gb200 test (sgl-project#14865) * Refactor Marlin MoeRunner (sgl-project#14554) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> * [6/n] Fix `num_token_non_padded` computation in prefill (sgl-project#14313) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> Co-authored-by: Runkai Tao <rt572@physics.rutger.edu> * Remove myself to test CI gate issue (sgl-project#14871) * fix: creating blobs only once for publish trace retries (sgl-project#14845) * Move and update MindSpore docs, make it appear on the online documentation (sgl-project#14861) Co-authored-by: wangtiance <tiancew@qq.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * fix nightly vlm ci : restore original eval for requests without regex (sgl-project#14875) * Only count limitations for previous runs that reaches the test stages (sgl-project#14856) * [CI][BUG] fix ib setup for disaggregation hicache test (sgl-project#14877) Signed-off-by: lukotong-7 <shicanwei.scw@alibaba-inc.com> * [Fix] Remove unused import from test_disaggregation_hicache.py (sgl-project#14880) * fix: adding temporary bypass for nightly tests (sgl-project#14876) * Avoid deleting entire cache for missing shards (sgl-project#14754 follow-up) (sgl-project#14853) * Tiny add more error info for bench_serving (sgl-project#14827) * Tiny support range ratio in GSP in bench serving (sgl-project#14828) * [diffusion] feat: enable torch compile to eliminate GPU bubble (sgl-project#13641) Co-authored-by: jianyingzhu <53300651@qq.com> Co-authored-by: Jianying <53503712+jianyingzhu@users.noreply.github.com> Co-authored-by: root <root@2u2g-spr-0417.ipp4a1.colossus.nvidia.com> Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> * [NPU] adapt dsv3.2 nsa prefill context parallel (sgl-project#14541) * [diffusion] feat: support sageattn & sageattn3 backend (sgl-project#14878) * dsv32 multistream opt * clean code * delete renormalize in topk * dsv32 use batch_matmul_transpose in MTP * modify comment * Support dynamic w8a8 * dsv3 support ascend_fuseep * rebase modify --------- Signed-off-by: Kay Yan <kay.yan@daocloud.io> Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com> Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com> Signed-off-by: Brayden Zhong <b8zhong@users.noreply.github.com> Signed-off-by: CLFutureX <chenyongqyl@163.com> Signed-off-by: Kun(llfl) <i@imux.top> Signed-off-by: zhanghaotong <zhanghaotong.zht@antgroup.com> Signed-off-by: Roi Koren <roik@nvidia.com> Signed-off-by: lukotong-7 <shicanwei.scw@alibaba-inc.com> Co-authored-by: Simo Lin <linsimo.mark@gmail.com> Co-authored-by: key4ng <rukeyang@gmail.com> Co-authored-by: YAMY <74099316+YAMY1234@users.noreply.github.com> Co-authored-by: Sam <lsam@nvidia.com> Co-authored-by: b8zhong <b8zhong@uwaterloo.ca> Co-authored-by: Brayden Zhong <b8zhong@users.noreply.github.com> Co-authored-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com> Co-authored-by: Kay Yan <kay.yan@daocloud.io> Co-authored-by: Chang Su <chang.s.su@oracle.com> Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: Yueming Yuan <yy28@illinois.edu> Co-authored-by: Junrong Lin <33685709+ocss884@users.noreply.github.com> Co-authored-by: sglang-bot <sglangbot@gmail.com> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: zyksir <zhuyikai.zyk@gmail.com> Co-authored-by: Alison Shao <54658187+alisonshao@users.noreply.github.com> Co-authored-by: Yinghai Lu <yinghai@thinkingmachines.ai> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: Minglei Zhu <mingleizhu1122@gmail.com> Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: GMI Xiao Jin <xiao.j@gmicloud.ai> Co-authored-by: dev <devnull@example.com> Co-authored-by: ishandhanani <82981111+ishandhanani@users.noreply.github.com> Co-authored-by: WenhaoZhang <42087078+niehen6174@users.noreply.github.com> Co-authored-by: niehen6174 <niehen.6174@gmail.com> Co-authored-by: roikoren755 <26850796+roikoren755@users.noreply.github.com> Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> Co-authored-by: Yuxuan Zhang <2448370773@qq.com> Co-authored-by: elvischenv <219235043+elvischenv@users.noreply.github.com> Co-authored-by: Linda-Stadter <57756729+Linda-Stadter@users.noreply.github.com> Co-authored-by: blahblah <28567807+Brain97@users.noreply.github.com> Co-authored-by: shuxiguo <shuxiguo@meituan.com> Co-authored-by: DefTruth <qiustudent_r@163.com> Co-authored-by: Hudson Xing <77495133+harvenstar@users.noreply.github.com> Co-authored-by: Qiaolin Yu <liin1211@outlook.com> Co-authored-by: Tony Lu <tonylu@linux.alibaba.com> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: Wenyi Xu <wenyixu101@gmail.com> Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com> Co-authored-by: Hanming Lu <69857889+hanming-lu@users.noreply.github.com> Co-authored-by: Vincent Zhong <207368749+vincentzed@users.noreply.github.com> Co-authored-by: Yuhao Yang <47235274+yhyang201@users.noreply.github.com> Co-authored-by: blzheng <beilei.zheng@intel.com> Co-authored-by: Rain Jiang <96632942+rainj-me@users.noreply.github.com> Co-authored-by: Trevor Morris <tmorris@nvidia.com> Co-authored-by: Feng Su <sufeng@linux.alibaba.com> Co-authored-by: niehen6174 <nihen6174@gmail.com> Co-authored-by: gongwei-130 <56567052+gongwei-130@users.noreply.github.com> Co-authored-by: harrisonlimh <97203667+harrisonlimh@users.noreply.github.com> Co-authored-by: Lee Nau <lnau@nvidia.com> Co-authored-by: almaslof <187766901+almaslof@users.noreply.github.com> Co-authored-by: Rain H <2510421000@qq.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Chen1022 <jincong.cjc@ant-intl.com> Co-authored-by: Tiwei Bie <tiwei.btw@antgroup.com> Co-authored-by: Jinwei Yao <jinweiy@illinois.edu> Co-authored-by: 赵晨阳 <zhaochen20@outlook.com> Co-authored-by: Yuan Luo <yuan.luo@hotmail.com> Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com> Co-authored-by: AichenF <aichenf@nvidia.com> Co-authored-by: jianyingzhu <53300651@qq.com> Co-authored-by: Jianying <53503712+jianyingzhu@users.noreply.github.com> Co-authored-by: Vladimir Serov <serov.vladimir.zser@gmail.com> Co-authored-by: khalilzhk <khalilzhk@gmail.com> Co-authored-by: Zhiyu <zhiyuc@nvidia.com> Co-authored-by: wentx <3843588+momaek@users.noreply.github.com> Co-authored-by: Nicholas <45984215+liusy58@users.noreply.github.com> Co-authored-by: Binyao Jiang <byjiang1996@gmail.com> Co-authored-by: yhyang201 <yhyang201@gmail.com> Co-authored-by: Muqi Li <muqi1029@gmail.com> Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com> Co-authored-by: Prozac614 <dwt614707404@163.com> Co-authored-by: Yibo Cai <yibo.cai@arm.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: yctseng0211 <yctseng@amd.com> Co-authored-by: Francis <38564764+ssssnow@users.noreply.github.com> Co-authored-by: PiteXChen <44110731+CLFutureX@users.noreply.github.com> Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> Co-authored-by: kk <43161300+kkHuang-amd@users.noreply.github.com> Co-authored-by: Jimmy <29097382+jimmy-evo@users.noreply.github.com> Co-authored-by: Even Zhou <even.y.zhou@outlook.com> Co-authored-by: Yineng Zhang <me@zhyncs.com> Co-authored-by: Byron Hsu <byronhsu1230@gmail.com> Co-authored-by: kun-llfl <i@imux.top> Co-authored-by: zhanghaotong <zhanghaotong.zht@antgroup.com> Co-authored-by: yrk111222 <2493404415@qq.com> Co-authored-by: yudian0504 <138860534+yudian0504@users.noreply.github.com> Co-authored-by: Douglas Yang <dyang@college.harvard.edu> Co-authored-by: Ethan (Yusheng) Su <yushengsu.thu@gmail.com> Co-authored-by: Beichen-Ma <bm685@cornell.edu> Co-authored-by: MingxuZh <109504044+MingxuZh@users.noreply.github.com> Co-authored-by: ShawnY112358 <61113840+ShawnY112358@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: TomerBN-Nvidia <tbarnatan@nvidia.com> Co-authored-by: Peng Zhang <aniz1905@gmail.com> Co-authored-by: Hecate0821 <hec4te0821@gmail.com> Co-authored-by: eternally-z <zzywzj@gmail.com> Co-authored-by: Wilboludriver <wilbolu@outlook.com> Co-authored-by: Wilbolu <81792854+Wilboludriver@users.noreply.github.com> Co-authored-by: Ke Bao <ispobaoke@gmail.com> Co-authored-by: ybyang <10629930+whybeyoung@users.noreply.github.com> Co-authored-by: liupeng374 <liupeng374@huawei.com> Co-authored-by: Li Jinliang <975761915@qq.com> Co-authored-by: Liangsheng Yin <lsyincs@gmail.com> Co-authored-by: Jue Wang <zjuwangjue@gmail.com> Co-authored-by: Praneth Paruchuri <pranethparuchuri@gmail.com> Co-authored-by: Siyuan Chen <41201609+SYChen123@users.noreply.github.com> Co-authored-by: michael-amd <Michael.Zhang@amd.com> Co-authored-by: Trang Do <200224632+trangdough@users.noreply.github.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> Co-authored-by: yuchengz816-bot <yuchengz816@gmail.com> Co-authored-by: Runkai Tao <rt572@physics.rutger.edu> Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com> Co-authored-by: Tiance Wang <wangtiance@gmail.com> Co-authored-by: wangtiance <tiancew@qq.com> Co-authored-by: shicanwei.scw <shicanwei.scw@alibaba-inc.com> Co-authored-by: Shangming Cai <csmthu@gmail.com> Co-authored-by: root <root@2u2g-spr-0417.ipp4a1.colossus.nvidia.com> Co-authored-by: liupeng374 <782420244@qq.com>

harvenstar requested review from BBuf, Edwardf0t1, Fridge003, HaiShaw, Ying1123, ch-wan, ispobock and merrymercy as code owners November 28, 2025 08:36

github-actions bot added the blackwell SM100/SM120 label Nov 28, 2025

gemini-code-assist bot reviewed Nov 28, 2025

View reviewed changes

harvenstar and others added 6 commits November 28, 2025 00:45

Merge branch 'main' into fix-mha-fp8-kv-cache

ada63a8

Update python/sglang/srt/models/qwen3_moe.py

1cbb34a

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Address code review feedback from Gemini Code Assist

3eeeb91

- Remove unused FP8_E4M3_MAX constant - Vectorize 4D cache write in naive fallback implementation - Extract duplicated fused path check logic into _should_use_fused_fp8_path method

Merge branch 'sgl-project:main' into fix-mha-fp8-kv-cache

cc242d3

Merge branch 'main' into fix-mha-fp8-kv-cache

2decbf8

github-actions bot added the run-ci label Nov 30, 2025

Qiaolin-Yu self-assigned this Nov 30, 2025

Qiaolin-Yu reviewed Dec 1, 2025

View reviewed changes

Qiaolin-Yu requested a review from FlamingoPg December 2, 2025 00:13

Qiaolin-Yu assigned FlamingoPg Dec 2, 2025

Qiaolin-Yu assigned ispobock Dec 3, 2025

Merge branch 'main' into fix-mha-fp8-kv-cache

620287d

ispobock reviewed Dec 3, 2025

View reviewed changes

github-actions bot added deepseek hicache Hierarchical Caching for SGLang sgl-kernel npu piecewise-cuda-graph diffusion SGLang Diffusion model-gateway labels Dec 4, 2025

harvenstar force-pushed the fix-mha-fp8-kv-cache branch 2 times, most recently from 033b77c to 139b961 Compare December 4, 2025 10:12

harvenstar and others added 3 commits December 4, 2025 10:16

Remove duplicate test file from test/srt directory

74dbca6

Merge branch 'main' into fix-mha-fp8-kv-cache

e3eeac8

Fix lint issues: import order and code formatting

02e0b87

Merge branch 'main' into fix-mha-fp8-kv-cache

60d9b77

harvenstar requested a review from ispobock December 5, 2025 15:32

ispobock merged commit 38daa29 into sgl-project:main Dec 5, 2025
153 of 163 checks passed

Qiaolin-Yu added a commit that referenced this pull request Dec 6, 2025

Revert "Add fused FP8 KV cache write kernel for TRTLLM MHA backend (#…

47d0fae

…14093)" This reverts commit 38daa29.

Qiaolin-Yu mentioned this pull request Dec 6, 2025

Revert "Add fused FP8 KV cache write kernel for TRTLLM MHA backend" #14550

Closed

harvenstar mentioned this pull request Dec 6, 2025

Fix FP8 KV Triton type issue and add regression test #14553

Merged

Qiaolin-Yu added a commit to Qiaolin-Yu/sglang that referenced this pull request Dec 7, 2025

Revert "Add fused FP8 KV cache write kernel for TRTLLM MHA backend (s…

f8794d0

…gl-project#14093)" This reverts commit 38daa29.

yuchengz816-bot pushed a commit to yuchengz816-bot/sglang that referenced this pull request Dec 8, 2025

Add fused FP8 KV cache write kernel for TRTLLM MHA backend (sgl-proje…

82173c7

…ct#14093) Co-authored-by: Qiaolin Yu <liin1211@outlook.com>

eternally-z pushed a commit to AniZpZ/sglang that referenced this pull request Dec 8, 2025

Add fused FP8 KV cache write kernel for TRTLLM MHA backend (sgl-proje…

25c83d2

…ct#14093) Co-authored-by: Qiaolin Yu <liin1211@outlook.com>

Kevin-XiongC pushed a commit to novitalabs/sglang that referenced this pull request Dec 9, 2025

Add fused FP8 KV cache write kernel for TRTLLM MHA backend (sgl-proje…

ed09e68

…ct#14093) Co-authored-by: Qiaolin Yu <liin1211@outlook.com>

dcampora pushed a commit to dcampora/sglang that referenced this pull request Dec 15, 2025

Add fused FP8 KV cache write kernel for TRTLLM MHA backend (sgl-proje…

5090a22

…ct#14093) Co-authored-by: Qiaolin Yu <liin1211@outlook.com>

YChange01 pushed a commit to YChange01/sglang that referenced this pull request Jan 13, 2026

Add fused FP8 KV cache write kernel for TRTLLM MHA backend (sgl-proje…

d8ea6bf

…ct#14093) Co-authored-by: Qiaolin Yu <liin1211@outlook.com>

Conversation

harvenstar commented Nov 28, 2025

Summary

Motivation

Changes

New fused FP8 KV kernel

Integration into TRTLLM attention backend

Accuracy and Testing

Unit tests

End to end evaluation (GSM8K)

Benchmarking and Profiling

Uh oh!

gemini-code-assist bot commented Nov 28, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Qiaolin-Yu commented Nov 30, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

harvenstar Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ispobock left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

harvenstar commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Throughput Improvement Summary

2. Full Benchmark Results and Profile Figure (For Reference)

Uh oh!

ispobock commented Dec 4, 2025

Uh oh!

harvenstar commented Dec 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

harvenstar Dec 2, 2025 •

edited

Loading

harvenstar commented Dec 3, 2025 •

edited

Loading