[Feature] Decode Context Parallel support for GPU model runner v2 by yewentao256 · Pull Request #34179 · vllm-project/vllm

yewentao256 · 2026-02-09T22:48:22Z

Purpose

Part of the #32455

Enable DCP for V2, with support of cuda graph

Performance slightly slower than V1, but we haven't really optimized the performance, both for V2 and DCP, we can optimize performance later in following up PRs

Test

export MODEL="deepseek-ai/DeepSeek-V2-lite"
export VLLM_USE_V2_MODEL_RUNNER=1
vllm serve $MODEL -tp 4  --port 9256 --enable-expert-parallel --max_num_seqs 128 -dcp 4

ACC

lm_eval --model local-completions --model_args "base_url=http://127.0.0.1:9256/v1/completions,model=$MODEL,num_concurrent=1024" --tasks gsm8k
# export VLLM_USE_V2_MODEL_RUNNER=1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.3844|±  |0.0134|
|     |       |strict-match    |     5|exact_match|↑  |0.3806|±  |0.0134|

# export VLLM_USE_V2_MODEL_RUNNER=0
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.3768|±  |0.0133|
|     |       |strict-match    |     5|exact_match|↑  |0.3730|±  |0.0133|

Perf

vllm bench serve --model $MODEL  --dataset-name random --host 127.0.0.1 --port 9256 --random-input-len 2 --random-output-len 512 --request-rate inf --num-prompts 128 --num-warmups 16
# V2
============ Serving Benchmark Result ============
Successful requests:                     128       
Failed requests:                         0         
Benchmark duration (s):                  5.84      
Total input tokens:                      128       
Total generated tokens:                  65536     
Request throughput (req/s):              21.90     
Output token throughput (tok/s):         11214.20  
Peak output token throughput (tok/s):    11648.00  
Peak concurrent requests:                128.00    
Total token throughput (tok/s):          11236.10  
---------------Time to First Token----------------
Mean TTFT (ms):                          187.06    
Median TTFT (ms):                        188.50    
P99 TTFT (ms):                           200.01    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          11.04     
Median TPOT (ms):                        11.04     
P99 TPOT (ms):                           11.14     
---------------Inter-token Latency----------------
Mean ITL (ms):                           11.04     
Median ITL (ms):                         11.00     
P99 ITL (ms):                            12.88     
==================================================

# V1
============ Serving Benchmark Result ============
Successful requests:                     128       
Failed requests:                         0         
Benchmark duration (s):                  5.50      
Total input tokens:                      128       
Total generated tokens:                  65536     
Request throughput (req/s):              23.28     
Output token throughput (tok/s):         11917.34  
Peak output token throughput (tok/s):    12313.00  
Peak concurrent requests:                128.00    
Total token throughput (tok/s):          11940.62  
---------------Time to First Token----------------
Mean TTFT (ms):                          138.40    
Median TTFT (ms):                        149.05    
P99 TTFT (ms):                           167.96    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          10.44     
Median TPOT (ms):                        10.43     
P99 TPOT (ms):                           10.50     
---------------Inter-token Latency----------------
Mean ITL (ms):                           10.44     
Median ITL (ms):                         10.43     
P99 ITL (ms):                            13.39     
==================================================

Signed-off-by: yewentao256 <zhyanwentao@126.com>

gemini-code-assist

Code Review

This pull request introduces Decode Context Parallel (DCP) support for the v2 GPU model runner. The changes involve updating attention utilities, block table management, and the model runner itself to handle sharded KV caches for DCP. The core logic is implemented in a Triton kernel for slot mapping. My review focuses on code structure and maintainability. I've identified an opportunity to reduce code duplication for better long-term maintenance.

vllm/v1/worker/gpu/model_runner.py

Signed-off-by: yewentao256 <zhyanwentao@126.com>

Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>

WoosukKwon

LGTM. Thanks for the PR!
I will follow up with simple refactoring.

…lm-project#34179) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

…lm-project#34179) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Jason Ozuzu <jasonozuzu@cohere.com>

…lm-project#34179) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

…lm-project#34179) Signed-off-by: yewentao256 <zhyanwentao@126.com>

…lm-project#34179) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Andrii Skliar <askliar@nvidia.com>

…lm-project#34179) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: EricccYang <yangyang4991@gmail.com>

…lm-project#34179) Signed-off-by: yewentao256 <zhyanwentao@126.com>

dcp support for model runner v2

195d4b2

Signed-off-by: yewentao256 <zhyanwentao@126.com>

yewentao256 requested a review from WoosukKwon as a code owner February 9, 2026 22:48

mergify bot added nvidia v1 labels Feb 9, 2026

yewentao256 requested review from njhill and removed request for WoosukKwon February 9, 2026 22:49

github-project-automation bot added this to NVIDIA Feb 9, 2026

yewentao256 requested a review from WoosukKwon February 9, 2026 22:49

gemini-code-assist bot reviewed Feb 9, 2026

View reviewed changes

vllm/v1/worker/gpu/model_runner.py Outdated Show resolved Hide resolved

update

068dc5f

Signed-off-by: yewentao256 <zhyanwentao@126.com>

yewentao256 added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 9, 2026

Siritao mentioned this pull request Feb 12, 2026

[Bug Fix] Re-add the DCP/PCP compatibility check for CUDA platform #34425

Closed

5 tasks

MatthewBonanni mentioned this pull request Feb 16, 2026

[Async][Spec Decoding] Zero-bubble async scheduling + spec decoding #32951

Merged

5 tasks

Merge branch 'main' into wentao-dcp-support-for-v2

18bdb65

Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>

WoosukKwon approved these changes Feb 18, 2026

View reviewed changes

github-project-automation bot moved this to Ready in NVIDIA Feb 18, 2026

WoosukKwon merged commit ab33d2a into main Feb 18, 2026
12 of 16 checks passed

WoosukKwon deleted the wentao-dcp-support-for-v2 branch February 18, 2026 00:27

github-project-automation bot moved this from Ready to Done in NVIDIA Feb 18, 2026

wzhao18 pushed a commit to wzhao18/vllm that referenced this pull request Feb 18, 2026

[Feature] Decode Context Parallel support for GPU model runner v2 (vl…

24d5191

…lm-project#34179) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

yewentao256 restored the wentao-dcp-support-for-v2 branch February 18, 2026 19:29

claude bot added the claude-code-assisted label Feb 19, 2026

ZJY0516 pushed a commit to ZJY0516/vllm that referenced this pull request Feb 23, 2026

[Feature] Decode Context Parallel support for GPU model runner v2 (vl…

514a898

…lm-project#34179) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

llsj14 pushed a commit to llsj14/vllm that referenced this pull request Mar 1, 2026

[Feature] Decode Context Parallel support for GPU model runner v2 (vl…

e5cac0a

…lm-project#34179) Signed-off-by: yewentao256 <zhyanwentao@126.com>

tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Mar 4, 2026

[Feature] Decode Context Parallel support for GPU model runner v2 (vl…

f19374b

…lm-project#34179) Signed-off-by: yewentao256 <zhyanwentao@126.com>

askliar pushed a commit to askliar/vllm that referenced this pull request Mar 9, 2026

[Feature] Decode Context Parallel support for GPU model runner v2 (vl…

c75ae31

…lm-project#34179) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Andrii Skliar <askliar@nvidia.com>

EricccYang pushed a commit to EricccYang/vllm that referenced this pull request Apr 1, 2026

[Feature] Decode Context Parallel support for GPU model runner v2 (vl…

ef8fbf2

…lm-project#34179) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: EricccYang <yangyang4991@gmail.com>

liuchenbing2026 pushed a commit to liuchenbing2026/vllm that referenced this pull request Apr 4, 2026

[Feature] Decode Context Parallel support for GPU model runner v2 (vl…

8209929

…lm-project#34179) Signed-off-by: yewentao256 <zhyanwentao@126.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Decode Context Parallel support for GPU model runner v2#34179

[Feature] Decode Context Parallel support for GPU model runner v2#34179
WoosukKwon merged 3 commits intomainfrom
wentao-dcp-support-for-v2

yewentao256 commented Feb 9, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

WoosukKwon left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

yewentao256 commented Feb 9, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test

ACC

Perf

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yewentao256 commented Feb 9, 2026 •

edited by github-actions bot

Loading