Skip to content

Encoder-Prefill-Decode (EPD) Disaggregation#274

Merged
wisclmy0611 merged 22 commits intolm-sys:mainfrom
yhyang201:epd
Jan 12, 2026
Merged

Encoder-Prefill-Decode (EPD) Disaggregation#274
wisclmy0611 merged 22 commits intolm-sys:mainfrom
yhyang201:epd

Conversation

@yhyang201
Copy link
Contributor

No description provided.

@yhyang201 yhyang201 changed the title [WIP] Encoder-Prefill-Decode (EPD) Disaggregation Encoder-Prefill-Decode (EPD) Disaggregation Dec 16, 2025
--backend vllm-chat \
--request-rate $request_rate
```
Mean TTFT (EPD vs colocate)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Describe the setup more precisely (e.g., how many GPUs each setting uses)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@gty111
Copy link
Contributor

gty111 commented Dec 28, 2025

EPD disaggregation can be advantage in single reques multi-images scenario. It is benefit both for latency and throughput.
Here are some benchmark results that keep the same number of GPUs between colocate and EPD disaggregation.

Qwen3-VL-30B-A3B, H100 x 8

  • colocate TP8 x 1
============ Serving Benchmark Result ============
Backend:                                 vllm-chat 
Traffic request rate:                    0.5       
Max request concurrency:                 not set   
Successful requests:                     32        
Benchmark duration (s):                  71.10     
Total input tokens:                      379190    
Total input text tokens:                 72502     
Total input vision tokens:               306688    
Total generated tokens:                  20        
Total generated tokens (retokenized):    20        
Request throughput (req/s):              0.45      
Input token throughput (tok/s):          5332.89   
Output token throughput (tok/s):         0.28      
Total token throughput (tok/s):          5333.17   
Concurrency:                             1.29      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   2865.13   
Median E2E Latency (ms):                 2530.16   
---------------Time to First Token----------------
Mean TTFT (ms):                          1769.44   
Median TTFT (ms):                        1699.01   
P99 TTFT (ms):                           6265.33   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00      
Median TPOT (ms):                        0.00      
P99 TPOT (ms):                           0.00      
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================
  • E (TP1) x 4 + PD (TP4) x 1
============ Serving Benchmark Result ============
Backend:                                 vllm-chat 
Traffic request rate:                    0.5       
Max request concurrency:                 not set   
Successful requests:                     32        
Benchmark duration (s):                  71.12     
Total input tokens:                      379258    
Total input text tokens:                 72570     
Total input vision tokens:               306688    
Total generated tokens:                  20        
Total generated tokens (retokenized):    20        
Request throughput (req/s):              0.45      
Input token throughput (tok/s):          5332.67   
Output token throughput (tok/s):         0.28      
Total token throughput (tok/s):          5332.95   
Concurrency:                             0.66      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   1468.98   
Median E2E Latency (ms):                 1404.18   
---------------Time to First Token----------------
Mean TTFT (ms):                          960.22    
Median TTFT (ms):                        1052.19   
P99 TTFT (ms):                           2907.76   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00      
Median TPOT (ms):                        0.00      
P99 TPOT (ms):                           0.00      
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================

Qwen2.5-7B-VL, H100 x 4

  • colocate TP4 x 1
============ Serving Benchmark Result ============
Backend:                                 vllm-chat 
Traffic request rate:                    0.5       
Max request concurrency:                 not set   
Successful requests:                     32        
Benchmark duration (s):                  95.81     
Total input tokens:                      244968    
Total input text tokens:                 67664     
Total input vision tokens:               177304    
Total generated tokens:                  18        
Total generated tokens (retokenized):    18        
Request throughput (req/s):              0.33      
Input token throughput (tok/s):          2556.79   
Output token throughput (tok/s):         0.19      
Total token throughput (tok/s):          2556.98   
Concurrency:                             0.40      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   1191.23   
Median E2E Latency (ms):                 1187.16   
---------------Time to First Token----------------
Mean TTFT (ms):                          718.54    
Median TTFT (ms):                        717.21    
P99 TTFT (ms):                           2108.17   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00      
Median TPOT (ms):                        0.00      
P99 TPOT (ms):                           0.00      
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================
  • E (TP1) x 3 + PD (TP1) x 1
#Input tokens: 244991
#Output tokens: 18
#Total images: 148
#Images per request: min=1, max=8, mean=4.62

============ Serving Benchmark Result ============
Backend:                                 vllm-chat 
Traffic request rate:                    0.5       
Max request concurrency:                 not set   
Successful requests:                     32        
Benchmark duration (s):                  95.82     
Total input tokens:                      244991    
Total input text tokens:                 67687     
Total input vision tokens:               177304    
Total generated tokens:                  18        
Total generated tokens (retokenized):    18        
Request throughput (req/s):              0.33      
Input token throughput (tok/s):          2556.73   
Output token throughput (tok/s):         0.19      
Total token throughput (tok/s):          2556.91   
Concurrency:                             0.20      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   609.72    
Median E2E Latency (ms):                 611.81    
---------------Time to First Token----------------
Mean TTFT (ms):                          378.07    
Median TTFT (ms):                        465.47    
P99 TTFT (ms):                           993.11    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00      
Median TPOT (ms):                        0.00      
P99 TPOT (ms):                           0.00      
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================

@wisclmy0611 wisclmy0611 merged commit 31eb7e7 into lm-sys:main Jan 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants