Skip to content

[Feature] Decode Context Parallel support for GPU model runner v2#34179

Merged
WoosukKwon merged 3 commits intomainfrom
wentao-dcp-support-for-v2
Feb 18, 2026
Merged

[Feature] Decode Context Parallel support for GPU model runner v2#34179
WoosukKwon merged 3 commits intomainfrom
wentao-dcp-support-for-v2

Conversation

@yewentao256
Copy link
Copy Markdown
Member

@yewentao256 yewentao256 commented Feb 9, 2026

Purpose

Part of the #32455

Enable DCP for V2, with support of cuda graph

Performance slightly slower than V1, but we haven't really optimized the performance, both for V2 and DCP, we can optimize performance later in following up PRs

Test

export MODEL="deepseek-ai/DeepSeek-V2-lite"
export VLLM_USE_V2_MODEL_RUNNER=1
vllm serve $MODEL -tp 4  --port 9256 --enable-expert-parallel --max_num_seqs 128 -dcp 4

ACC

lm_eval --model local-completions --model_args "base_url=http://127.0.0.1:9256/v1/completions,model=$MODEL,num_concurrent=1024" --tasks gsm8k
# export VLLM_USE_V2_MODEL_RUNNER=1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match||0.3844|±  |0.0134|
|     |       |strict-match    |     5|exact_match||0.3806|±  |0.0134|

# export VLLM_USE_V2_MODEL_RUNNER=0
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match||0.3768|±  |0.0133|
|     |       |strict-match    |     5|exact_match||0.3730|±  |0.0133|

Perf

vllm bench serve --model $MODEL  --dataset-name random --host 127.0.0.1 --port 9256 --random-input-len 2 --random-output-len 512 --request-rate inf --num-prompts 128 --num-warmups 16
# V2
============ Serving Benchmark Result ============
Successful requests:                     128       
Failed requests:                         0         
Benchmark duration (s):                  5.84      
Total input tokens:                      128       
Total generated tokens:                  65536     
Request throughput (req/s):              21.90     
Output token throughput (tok/s):         11214.20  
Peak output token throughput (tok/s):    11648.00  
Peak concurrent requests:                128.00    
Total token throughput (tok/s):          11236.10  
---------------Time to First Token----------------
Mean TTFT (ms):                          187.06    
Median TTFT (ms):                        188.50    
P99 TTFT (ms):                           200.01    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          11.04     
Median TPOT (ms):                        11.04     
P99 TPOT (ms):                           11.14     
---------------Inter-token Latency----------------
Mean ITL (ms):                           11.04     
Median ITL (ms):                         11.00     
P99 ITL (ms):                            12.88     
==================================================

# V1
============ Serving Benchmark Result ============
Successful requests:                     128       
Failed requests:                         0         
Benchmark duration (s):                  5.50      
Total input tokens:                      128       
Total generated tokens:                  65536     
Request throughput (req/s):              23.28     
Output token throughput (tok/s):         11917.34  
Peak output token throughput (tok/s):    12313.00  
Peak concurrent requests:                128.00    
Total token throughput (tok/s):          11940.62  
---------------Time to First Token----------------
Mean TTFT (ms):                          138.40    
Median TTFT (ms):                        149.05    
P99 TTFT (ms):                           167.96    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          10.44     
Median TPOT (ms):                        10.43     
P99 TPOT (ms):                           10.50     
---------------Inter-token Latency----------------
Mean ITL (ms):                           10.44     
Median ITL (ms):                         10.43     
P99 ITL (ms):                            13.39     
==================================================

Signed-off-by: yewentao256 <zhyanwentao@126.com>
@yewentao256 yewentao256 requested review from njhill and removed request for WoosukKwon February 9, 2026 22:49
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces Decode Context Parallel (DCP) support for the v2 GPU model runner. The changes involve updating attention utilities, block table management, and the model runner itself to handle sharded KV caches for DCP. The core logic is implemented in a Triton kernel for slot mapping. My review focuses on code structure and maintainability. I've identified an opportunity to reduce code duplication for better long-term maintenance.

Signed-off-by: yewentao256 <zhyanwentao@126.com>
@yewentao256 yewentao256 added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 9, 2026
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
Copy link
Copy Markdown
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the PR!
I will follow up with simple refactoring.

@github-project-automation github-project-automation bot moved this to Ready in NVIDIA Feb 18, 2026
@WoosukKwon WoosukKwon merged commit ab33d2a into main Feb 18, 2026
12 of 16 checks passed
@WoosukKwon WoosukKwon deleted the wentao-dcp-support-for-v2 branch February 18, 2026 00:27
@github-project-automation github-project-automation bot moved this from Ready to Done in NVIDIA Feb 18, 2026
wzhao18 pushed a commit to wzhao18/vllm that referenced this pull request Feb 18, 2026
…lm-project#34179)

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
jasonozuzu-cohere pushed a commit to jasonozuzu-cohere/vllm that referenced this pull request Feb 18, 2026
…lm-project#34179)

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Jason Ozuzu <jasonozuzu@cohere.com>
@yewentao256 yewentao256 restored the wentao-dcp-support-for-v2 branch February 18, 2026 19:29
ZJY0516 pushed a commit to ZJY0516/vllm that referenced this pull request Feb 23, 2026
…lm-project#34179)

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
llsj14 pushed a commit to llsj14/vllm that referenced this pull request Mar 1, 2026
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Mar 4, 2026
askliar pushed a commit to askliar/vllm that referenced this pull request Mar 9, 2026
…lm-project#34179)

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Andrii Skliar <askliar@nvidia.com>
EricccYang pushed a commit to EricccYang/vllm that referenced this pull request Apr 1, 2026
…lm-project#34179)

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: EricccYang <yangyang4991@gmail.com>
liuchenbing2026 pushed a commit to liuchenbing2026/vllm that referenced this pull request Apr 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

claude-code-assisted nvidia ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants