Encoder-Prefill-Decode (EPD) Disaggregation by yhyang201 · Pull Request #274 · lm-sys/lm-sys.github.io

yhyang201 · 2025-12-15T03:48:50Z

No description provided.

merrymercy · 2025-12-26T03:33:21Z

blog/2026-01-12-epd.md

+    --backend vllm-chat \
+    --request-rate $request_rate 
+```
+Mean TTFT (EPD vs colocate)


Describe the setup more precisely (e.g., how many GPUs each setting uses)

blog/2025-12-16-epd.md

blog/2026-01-12-epd.md

blog/2025-12-16-epd.md

gty111 · 2025-12-28T08:29:47Z

EPD disaggregation can be advantage in single reques multi-images scenario. It is benefit both for latency and throughput.
Here are some benchmark results that keep the same number of GPUs between colocate and EPD disaggregation.

Qwen3-VL-30B-A3B, H100 x 8

colocate TP8 x 1

============ Serving Benchmark Result ============
Backend:                                 vllm-chat 
Traffic request rate:                    0.5       
Max request concurrency:                 not set   
Successful requests:                     32        
Benchmark duration (s):                  71.10     
Total input tokens:                      379190    
Total input text tokens:                 72502     
Total input vision tokens:               306688    
Total generated tokens:                  20        
Total generated tokens (retokenized):    20        
Request throughput (req/s):              0.45      
Input token throughput (tok/s):          5332.89   
Output token throughput (tok/s):         0.28      
Total token throughput (tok/s):          5333.17   
Concurrency:                             1.29      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   2865.13   
Median E2E Latency (ms):                 2530.16   
---------------Time to First Token----------------
Mean TTFT (ms):                          1769.44   
Median TTFT (ms):                        1699.01   
P99 TTFT (ms):                           6265.33   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00      
Median TPOT (ms):                        0.00      
P99 TPOT (ms):                           0.00      
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================

E (TP1) x 4 + PD (TP4) x 1

============ Serving Benchmark Result ============
Backend:                                 vllm-chat 
Traffic request rate:                    0.5       
Max request concurrency:                 not set   
Successful requests:                     32        
Benchmark duration (s):                  71.12     
Total input tokens:                      379258    
Total input text tokens:                 72570     
Total input vision tokens:               306688    
Total generated tokens:                  20        
Total generated tokens (retokenized):    20        
Request throughput (req/s):              0.45      
Input token throughput (tok/s):          5332.67   
Output token throughput (tok/s):         0.28      
Total token throughput (tok/s):          5332.95   
Concurrency:                             0.66      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   1468.98   
Median E2E Latency (ms):                 1404.18   
---------------Time to First Token----------------
Mean TTFT (ms):                          960.22    
Median TTFT (ms):                        1052.19   
P99 TTFT (ms):                           2907.76   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00      
Median TPOT (ms):                        0.00      
P99 TPOT (ms):                           0.00      
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================

Qwen2.5-7B-VL, H100 x 4

colocate TP4 x 1

============ Serving Benchmark Result ============
Backend:                                 vllm-chat 
Traffic request rate:                    0.5       
Max request concurrency:                 not set   
Successful requests:                     32        
Benchmark duration (s):                  95.81     
Total input tokens:                      244968    
Total input text tokens:                 67664     
Total input vision tokens:               177304    
Total generated tokens:                  18        
Total generated tokens (retokenized):    18        
Request throughput (req/s):              0.33      
Input token throughput (tok/s):          2556.79   
Output token throughput (tok/s):         0.19      
Total token throughput (tok/s):          2556.98   
Concurrency:                             0.40      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   1191.23   
Median E2E Latency (ms):                 1187.16   
---------------Time to First Token----------------
Mean TTFT (ms):                          718.54    
Median TTFT (ms):                        717.21    
P99 TTFT (ms):                           2108.17   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00      
Median TPOT (ms):                        0.00      
P99 TPOT (ms):                           0.00      
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================

E (TP1) x 3 + PD (TP1) x 1

#Input tokens: 244991
#Output tokens: 18
#Total images: 148
#Images per request: min=1, max=8, mean=4.62

============ Serving Benchmark Result ============
Backend:                                 vllm-chat 
Traffic request rate:                    0.5       
Max request concurrency:                 not set   
Successful requests:                     32        
Benchmark duration (s):                  95.82     
Total input tokens:                      244991    
Total input text tokens:                 67687     
Total input vision tokens:               177304    
Total generated tokens:                  18        
Total generated tokens (retokenized):    18        
Request throughput (req/s):              0.33      
Input token throughput (tok/s):          2556.73   
Output token throughput (tok/s):         0.19      
Total token throughput (tok/s):          2556.91   
Concurrency:                             0.20      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   609.72    
Median E2E Latency (ms):                 611.81    
---------------Time to First Token----------------
Mean TTFT (ms):                          378.07    
Median TTFT (ms):                        465.47    
P99 TTFT (ms):                           993.11    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00      
Median TPOT (ms):                        0.00      
P99 TPOT (ms):                           0.00      
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================

yhyang201 and others added 16 commits December 14, 2025 16:45

upd

9af1ac0

upd

b4666b1

add

6bee7b6

add Acknowledgment

c80c4cd

upd

cd13f9f

fix

aa21044

fix

c5b2920

fix

7fb5c85

upd

625216d

upd

a315d3d

upd

9032d81

fix

32b8798

fix

0bf47f6

add note

2d4076c

fix

389a75a

fix

7b28cde

yhyang201 changed the title ~~[WIP] Encoder-Prefill-Decode (EPD) Disaggregation~~ Encoder-Prefill-Decode (EPD) Disaggregation Dec 16, 2025

Delete blog/Untitled

be9db76

merrymercy reviewed Dec 26, 2025

View reviewed changes

fix

9719c16

gty111 reviewed Dec 28, 2025

View reviewed changes

blog/2025-12-16-epd.md Outdated Show resolved Hide resolved

liusy58 added 4 commits December 29, 2025 11:33

delete

6b1e256

fix

78ad192

fix

4000fed

fix

a8de739

wisclmy0611 merged commit 31eb7e7 into lm-sys:main Jan 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoder-Prefill-Decode (EPD) Disaggregation#274

Encoder-Prefill-Decode (EPD) Disaggregation#274
wisclmy0611 merged 22 commits intolm-sys:mainfrom
yhyang201:epd

yhyang201 commented Dec 15, 2025

Uh oh!

merrymercy Dec 26, 2025

Uh oh!

liusy58 Dec 26, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gty111 commented Dec 28, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

yhyang201 commented Dec 15, 2025

Uh oh!

merrymercy Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

liusy58 Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gty111 commented Dec 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Qwen3-VL-30B-A3B, H100 x 8

Qwen2.5-7B-VL, H100 x 4

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

gty111 commented Dec 28, 2025 •

edited

Loading