[DSA] Support trtllm sparse mla kernel for prefill batches by Fridge003 · Pull Request #21783 · sgl-project/sglang

Fridge003 · 2026-03-31T22:48:04Z

Motivation

Depends on flashinfer v0.6.7 #21422

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

gemini-code-assist · 2026-03-31T22:48:08Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

gemini-code-assist · 2026-04-01T05:58:22Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Fridge003 · 2026-04-01T06:09:12Z

Benchmark results

Model: GLM-5 fp8 on B200, TP8
Baseline: fp8 kv cache + flashmla_sparse prefill + flashmla_kv decode

sglang serve --model-path zai-org/GLM-5-FP8 --tp 8 --trust-remote-code --host 127.0.0.1 --port 30000 --nsa-prefill-backend flashmla_sparse --nsa-decode-backend flashmla_kv --kv-cache-dtype fp8_e4m3

PR: fp8 kv cache + trtllm preill + trtllm decode

sglang serve --model-path zai-org/GLM-5-FP8 --tp 8 --trust-remote-code --host 127.0.0.1 --port 30000 --nsa-prefill-backend trtllm --nsa-decode-backend trtllm --kv-cache-dtype fp8_e4m3

Send-One

Baseline:

+-------------+--------+------------+-----------------+
| Latency (s) | Tokens | Acc Length | Speed (token/s) |
+-------------+--------+------------+-----------------+
|    8.062    |  512   |   1.000    |      63.51      |
+-------------+--------+------------+-----------------+

PR:

+-------------+--------+------------+-----------------+
| Latency (s) | Tokens | Acc Length | Speed (token/s) |
+-------------+--------+------------+-----------------+
|    8.041    |  512   |   1.000    |      63.67      |
+-------------+--------+------------+-----------------+

1k-isl-1k-osl-16-concurrency

python3 -m sglang.bench_serving --backend sglang --num-prompts 64 --dataset-name random --random-input 1024 --random-output 1024 --random-range-ratio 1.0 --max-concurrency 16

Baseline:

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 16        
Successful requests:                     64        
Benchmark duration (s):                  89.52     
Total input tokens:                      65536     
Total input text tokens:                 65536     
Total generated tokens:                  65536     
Total generated tokens (retokenized):    65523     
Request throughput (req/s):              0.71      
Input token throughput (tok/s):          732.08    
Output token throughput (tok/s):         732.08    
Peak output token throughput (tok/s):    768.00    
Peak concurrent requests:                32        
Total token throughput (tok/s):          1464.16   
Concurrency:                             15.99     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   22371.28  
Median E2E Latency (ms):                 22367.17  
P90 E2E Latency (ms):                    22500.67  
P99 E2E Latency (ms):                    22501.75  
---------------Time to First Token----------------
Mean TTFT (ms):                          702.87    
Median TTFT (ms):                        717.30    
P99 TTFT (ms):                           747.08    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          21.18     
Median TPOT (ms):                        21.24     
P99 TPOT (ms):                           21.41     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           21.18     
Median ITL (ms):                         21.16     
P95 ITL (ms):                            21.71     
P99 ITL (ms):                            22.17     
Max ITL (ms):                            163.60    
==================================================

PR:

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 16        
Successful requests:                     64        
Benchmark duration (s):                  89.26     
Total input tokens:                      65536     
Total input text tokens:                 65536     
Total generated tokens:                  65536     
Total generated tokens (retokenized):    65511     
Request throughput (req/s):              0.72      
Input token throughput (tok/s):          734.18    
Output token throughput (tok/s):         734.18    
Peak output token throughput (tok/s):    784.00    
Peak concurrent requests:                32        
Total token throughput (tok/s):          1468.36   
Concurrency:                             15.99     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   22307.50  
Median E2E Latency (ms):                 22355.14  
P90 E2E Latency (ms):                    22434.27  
P99 E2E Latency (ms):                    22435.01  
---------------Time to First Token----------------
Mean TTFT (ms):                          718.44    
Median TTFT (ms):                        722.77    
P99 TTFT (ms):                           762.38    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          21.10     
Median TPOT (ms):                        21.20     
P99 TPOT (ms):                           21.35     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           21.10     
Median ITL (ms):                         21.10     
P95 ITL (ms):                            21.66     
P99 ITL (ms):                            21.93     
Max ITL (ms):                            165.95    
==================================================

Fridge003 · 2026-04-01T06:40:14Z

Accuracy Results

Launch:

sglang serve --model-path zai-org/GLM-5-FP8 --tp 8 --dp 8 --enable-dp-attention --trust-remote-code --host 127.0.0.1 --port 30000 --nsa-prefill-backend trtllm --nsa-decode-backend trtllm --kv-cache-dtype fp8_e4m3 --reasoning-parser glm45 --mem-fraction-static 0.85

GSM8k-20 shots:

python3 benchmark/gsm8k/bench_sglang.py --num-shots 20 --num-questions 1319 --parallel 1319

Accuracy: 0.951
Invalid: 0.000
Latency: 50.628 s
Output throughput: 2715.272 token/s

GPQA:

python3 -m sglang.test.run_eval --port 30000 --eval-name gpqa --num-examples 198 --max-tokens 128000 --repeat 8 --thinking-mode glm-45

Repeat: 8, mean: 0.849
Scores: ['0.854', '0.854', '0.869', '0.838', '0.838', '0.848', '0.823', '0.869']

Fridge003 · 2026-04-01T19:21:43Z

/rerun-stage stage-c-test-8-gpu-h200

Fridge003 · 2026-04-01T19:21:52Z

/rerun-stage stage-c-test-4-gpu-b200

github-actions · 2026-04-01T19:22:10Z

✅ Triggered stage-c-test-8-gpu-h200 to run independently (skipping dependencies). View workflow run

github-actions · 2026-04-01T19:22:21Z

✅ Triggered stage-c-test-4-gpu-b200 to run independently (skipping dependencies). View workflow run

Fridge003 added 2 commits March 31, 2026 13:43

upd

52f2f2a

upd

83e21ba

Fridge003 requested review from HaiShaw, Qiaolin-Yu, hebiao064, ispobock and merrymercy as code owners March 31, 2026 22:48

Fridge003 marked this pull request as draft March 31, 2026 22:48

Fridge003 mentioned this pull request Mar 31, 2026

[Roadmap] DeepSeek v3.2 (GLM 5) Optimization #15025

Open

39 tasks

Fridge003 marked this pull request as ready for review April 1, 2026 05:58

Merge remote-tracking branch 'origin/main' into trtllm-prefill-nsa

870f9b9

Fridge003 added 2 commits March 31, 2026 23:53

upd

766c5e4

Merge branch 'main' into trtllm-prefill-nsa

9505e67

Fridge003 merged commit 5e12c4e into main Apr 1, 2026
101 of 111 checks passed

Fridge003 deleted the trtllm-prefill-nsa branch April 1, 2026 20:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DSA] Support trtllm sparse mla kernel for prefill batches #21783

[DSA] Support trtllm sparse mla kernel for prefill batches #21783
Fridge003 merged 5 commits intomainfrom
trtllm-prefill-nsa

Fridge003 commented Mar 31, 2026

Uh oh!

gemini-code-assist bot commented Mar 31, 2026

Uh oh!

gemini-code-assist bot commented Apr 1, 2026

Uh oh!

Fridge003 commented Apr 1, 2026 •

edited

Loading

Uh oh!

Fridge003 commented Apr 1, 2026 •

edited

Loading

Uh oh!

Fridge003 commented Apr 1, 2026

Uh oh!

Fridge003 commented Apr 1, 2026

Uh oh!

github-actions bot commented Apr 1, 2026

Uh oh!

github-actions bot commented Apr 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Fridge003 commented Mar 31, 2026

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist bot commented Mar 31, 2026

Uh oh!

gemini-code-assist bot commented Apr 1, 2026

Uh oh!

Fridge003 commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark results

Send-One

1k-isl-1k-osl-16-concurrency

Uh oh!

Fridge003 commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Accuracy Results

Uh oh!

Fridge003 commented Apr 1, 2026

Uh oh!

Fridge003 commented Apr 1, 2026

Uh oh!

github-actions bot commented Apr 1, 2026

Uh oh!

github-actions bot commented Apr 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fridge003 commented Apr 1, 2026 •

edited

Loading

Fridge003 commented Apr 1, 2026 •

edited

Loading