Pipeline prefill chunks #99

gty111 · 2025-07-18T08:02:22Z

This PR can pipeline prefill chunks of one request. Thus it can greatly lower latency for long text input.

Test on 4 x 4090, 32 reqs, 0.5 reqs/s, PP4, TP1

Before this PR

============ Serving Benchmark Result ============
Successful requests:                     32        
Benchmark duration (s):                  94.16     
Total input tokens:                      314732    
Total generated tokens:                  8411      
Request throughput (req/s):              0.34      
Output token throughput (tok/s):         89.32     
Total Token throughput (tok/s):          3431.73   
---------------Time to First Token----------------
Mean TTFT (ms):                          4374.34   
Median TTFT (ms):                        4177.95   
P99 TTFT (ms):                           6679.84   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          75.85     
Median TPOT (ms):                        75.12     
P99 TPOT (ms):                           185.22    
---------------Inter-token Latency----------------
Mean ITL (ms):                           88.82     
Median ITL (ms):                         42.95     
P99 ITL (ms):                            344.36    
----------------End-to-end Latency----------------
Mean E2EL (ms):                          22068.80  
Median E2EL (ms):                        20669.61  
P99 E2EL (ms):                           46538.70  
==================================================

After this PR

============ Serving Benchmark Result ============
Successful requests:                     32        
Benchmark duration (s):                  91.59     
Total input tokens:                      314732    
Total generated tokens:                  8449      
Request throughput (req/s):              0.35      
Output token throughput (tok/s):         92.25     
Total Token throughput (tok/s):          3528.74   
---------------Time to First Token----------------
Mean TTFT (ms):                          1200.22   
Median TTFT (ms):                        1086.74   
P99 TTFT (ms):                           2277.26   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          52.04     
Median TPOT (ms):                        52.14     
P99 TPOT (ms):                           90.41     
---------------Inter-token Latency----------------
Mean ITL (ms):                           57.49     
Median ITL (ms):                         39.60     
P99 ITL (ms):                            376.58    
----------------End-to-end Latency----------------
Mean E2EL (ms):                          14306.62  
Median E2EL (ms):                        13230.37  
P99 E2EL (ms):                           32460.10  
==================================================

gty111 added 3 commits July 18, 2025 14:51

Continue prefill

bb3ff1b

Fix log time

ddd315b

Conintue prefill for naive scheduling

44a035b

gty111 merged commit 1293c7e into master Jul 18, 2025
2 checks passed

gty111 deleted the continue_prefill branch July 18, 2025 08:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pipeline prefill chunks #99

Pipeline prefill chunks #99

Uh oh!

gty111 commented Jul 18, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Pipeline prefill chunks #99

Pipeline prefill chunks #99

Uh oh!

Conversation

gty111 commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gty111 commented Jul 18, 2025 •

edited

Loading