Skip to content

[DSA] Support trtllm sparse mla kernel for prefill batches #21783

Merged
Fridge003 merged 5 commits intomainfrom
trtllm-prefill-nsa
Apr 1, 2026
Merged

[DSA] Support trtllm sparse mla kernel for prefill batches #21783
Fridge003 merged 5 commits intomainfrom
trtllm-prefill-nsa

Conversation

@Fridge003
Copy link
Copy Markdown
Collaborator

Motivation

Depends on flashinfer v0.6.7 #21422

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@Fridge003 Fridge003 marked this pull request as draft March 31, 2026 22:48
@Fridge003 Fridge003 marked this pull request as ready for review April 1, 2026 05:58
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@Fridge003
Copy link
Copy Markdown
Collaborator Author

Fridge003 commented Apr 1, 2026

Benchmark results

Model: GLM-5 fp8 on B200, TP8
Baseline: fp8 kv cache + flashmla_sparse prefill + flashmla_kv decode

sglang serve --model-path zai-org/GLM-5-FP8 --tp 8 --trust-remote-code --host 127.0.0.1 --port 30000 --nsa-prefill-backend flashmla_sparse --nsa-decode-backend flashmla_kv --kv-cache-dtype fp8_e4m3

PR: fp8 kv cache + trtllm preill + trtllm decode

sglang serve --model-path zai-org/GLM-5-FP8 --tp 8 --trust-remote-code --host 127.0.0.1 --port 30000 --nsa-prefill-backend trtllm --nsa-decode-backend trtllm --kv-cache-dtype fp8_e4m3

Send-One

Baseline:

+-------------+--------+------------+-----------------+
| Latency (s) | Tokens | Acc Length | Speed (token/s) |
+-------------+--------+------------+-----------------+
|    8.062    |  512   |   1.000    |      63.51      |
+-------------+--------+------------+-----------------+

PR:

+-------------+--------+------------+-----------------+
| Latency (s) | Tokens | Acc Length | Speed (token/s) |
+-------------+--------+------------+-----------------+
|    8.041    |  512   |   1.000    |      63.67      |
+-------------+--------+------------+-----------------+

1k-isl-1k-osl-16-concurrency

python3 -m sglang.bench_serving --backend sglang --num-prompts 64 --dataset-name random --random-input 1024 --random-output 1024 --random-range-ratio 1.0 --max-concurrency 16

Baseline:

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 16        
Successful requests:                     64        
Benchmark duration (s):                  89.52     
Total input tokens:                      65536     
Total input text tokens:                 65536     
Total generated tokens:                  65536     
Total generated tokens (retokenized):    65523     
Request throughput (req/s):              0.71      
Input token throughput (tok/s):          732.08    
Output token throughput (tok/s):         732.08    
Peak output token throughput (tok/s):    768.00    
Peak concurrent requests:                32        
Total token throughput (tok/s):          1464.16   
Concurrency:                             15.99     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   22371.28  
Median E2E Latency (ms):                 22367.17  
P90 E2E Latency (ms):                    22500.67  
P99 E2E Latency (ms):                    22501.75  
---------------Time to First Token----------------
Mean TTFT (ms):                          702.87    
Median TTFT (ms):                        717.30    
P99 TTFT (ms):                           747.08    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          21.18     
Median TPOT (ms):                        21.24     
P99 TPOT (ms):                           21.41     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           21.18     
Median ITL (ms):                         21.16     
P95 ITL (ms):                            21.71     
P99 ITL (ms):                            22.17     
Max ITL (ms):                            163.60    
==================================================

PR:

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 16        
Successful requests:                     64        
Benchmark duration (s):                  89.26     
Total input tokens:                      65536     
Total input text tokens:                 65536     
Total generated tokens:                  65536     
Total generated tokens (retokenized):    65511     
Request throughput (req/s):              0.72      
Input token throughput (tok/s):          734.18    
Output token throughput (tok/s):         734.18    
Peak output token throughput (tok/s):    784.00    
Peak concurrent requests:                32        
Total token throughput (tok/s):          1468.36   
Concurrency:                             15.99     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   22307.50  
Median E2E Latency (ms):                 22355.14  
P90 E2E Latency (ms):                    22434.27  
P99 E2E Latency (ms):                    22435.01  
---------------Time to First Token----------------
Mean TTFT (ms):                          718.44    
Median TTFT (ms):                        722.77    
P99 TTFT (ms):                           762.38    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          21.10     
Median TPOT (ms):                        21.20     
P99 TPOT (ms):                           21.35     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           21.10     
Median ITL (ms):                         21.10     
P95 ITL (ms):                            21.66     
P99 ITL (ms):                            21.93     
Max ITL (ms):                            165.95    
==================================================

@Fridge003
Copy link
Copy Markdown
Collaborator Author

Fridge003 commented Apr 1, 2026

Accuracy Results

Launch:

sglang serve --model-path zai-org/GLM-5-FP8 --tp 8 --dp 8 --enable-dp-attention --trust-remote-code --host 127.0.0.1 --port 30000 --nsa-prefill-backend trtllm --nsa-decode-backend trtllm --kv-cache-dtype fp8_e4m3 --reasoning-parser glm45 --mem-fraction-static 0.85

GSM8k-20 shots:

python3 benchmark/gsm8k/bench_sglang.py --num-shots 20 --num-questions 1319 --parallel 1319

Accuracy: 0.951
Invalid: 0.000
Latency: 50.628 s
Output throughput: 2715.272 token/s

GPQA:

python3 -m sglang.test.run_eval --port 30000 --eval-name gpqa --num-examples 198 --max-tokens 128000 --repeat 8 --thinking-mode glm-45

Repeat: 8, mean: 0.849
Scores: ['0.854', '0.854', '0.869', '0.838', '0.838', '0.848', '0.823', '0.869'] 

@Fridge003
Copy link
Copy Markdown
Collaborator Author

/rerun-stage stage-c-test-8-gpu-h200

@Fridge003
Copy link
Copy Markdown
Collaborator Author

/rerun-stage stage-c-test-4-gpu-b200

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

✅ Triggered stage-c-test-8-gpu-h200 to run independently (skipping dependencies). View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 1, 2026

✅ Triggered stage-c-test-4-gpu-b200 to run independently (skipping dependencies). View workflow run

@Fridge003 Fridge003 merged commit 5e12c4e into main Apr 1, 2026
101 of 111 checks passed
@Fridge003 Fridge003 deleted the trtllm-prefill-nsa branch April 1, 2026 20:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant