Skip to content

Conversation

@gty111
Copy link
Owner

@gty111 gty111 commented Jul 18, 2025

reference vllm-project/vllm#17080

This PR can pipeline prefill chunks of one request. Thus it can greatly lower latency for long text input.

Test on 4 x 4090, 32 reqs, 0.5 reqs/s, PP4, TP1

Before this PR

============ Serving Benchmark Result ============
Successful requests:                     32        
Benchmark duration (s):                  94.16     
Total input tokens:                      314732    
Total generated tokens:                  8411      
Request throughput (req/s):              0.34      
Output token throughput (tok/s):         89.32     
Total Token throughput (tok/s):          3431.73   
---------------Time to First Token----------------
Mean TTFT (ms):                          4374.34   
Median TTFT (ms):                        4177.95   
P99 TTFT (ms):                           6679.84   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          75.85     
Median TPOT (ms):                        75.12     
P99 TPOT (ms):                           185.22    
---------------Inter-token Latency----------------
Mean ITL (ms):                           88.82     
Median ITL (ms):                         42.95     
P99 ITL (ms):                            344.36    
----------------End-to-end Latency----------------
Mean E2EL (ms):                          22068.80  
Median E2EL (ms):                        20669.61  
P99 E2EL (ms):                           46538.70  
==================================================

After this PR

============ Serving Benchmark Result ============
Successful requests:                     32        
Benchmark duration (s):                  91.59     
Total input tokens:                      314732    
Total generated tokens:                  8449      
Request throughput (req/s):              0.35      
Output token throughput (tok/s):         92.25     
Total Token throughput (tok/s):          3528.74   
---------------Time to First Token----------------
Mean TTFT (ms):                          1200.22   
Median TTFT (ms):                        1086.74   
P99 TTFT (ms):                           2277.26   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          52.04     
Median TPOT (ms):                        52.14     
P99 TPOT (ms):                           90.41     
---------------Inter-token Latency----------------
Mean ITL (ms):                           57.49     
Median ITL (ms):                         39.60     
P99 ITL (ms):                            376.58    
----------------End-to-end Latency----------------
Mean E2EL (ms):                          14306.62  
Median E2EL (ms):                        13230.37  
P99 E2EL (ms):                           32460.10  
==================================================

@gty111 gty111 merged commit 1293c7e into master Jul 18, 2025
2 checks passed
@gty111 gty111 deleted the continue_prefill branch July 18, 2025 08:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants