[DCP] Support dcp kv_cache interleave size > 1 by zhangsicheng5 · Pull Request #26696 · vllm-project/vllm

zhangsicheng5 · 2025-10-13T09:44:14Z

Purpose

1. cp_kv_cache_interleave_size support

In dcp scenario, kv_cache is split across dcp ranks, current implementation (#23734) split kv_cache with a token-level interleave style: token_idx i is stored on GPU whose dcp_rank == i % dcp_world_size.

For the convenience of pd disaggregate support, we add the cp_kv_cache_interleave_size argument to control the interleave size of kv_cache split size: store interleave_size tokens on dcp i, then store next interleave_size tokens on dcp i+1. The default value of cp_kv_cache_interleave_size is 1, which is same as original token-level interleave implementation. By setting cp_kv_cache_interleave_size to block_size, we can split kv_cache with a block-level interleave style, and makes it easy to support pd disaggregate with dcp > 1: D nodes only need to pull the corresponding kv_cache blocks, without need to rearange tokens in blocks.

Only dcp with cp_kv_cache_interleave_size is supported now, but the case of (p)cp is also considered and is easy to extend in the future.

2. Move dcp_local_seq_lens computation to utils

Move dcp_local_seq_lens computation to utils and pass it by metadata, so other attn backends can reuse it.

Test Plan

Model: DeepSeek-V2-Lite-Chat
Dataset: gsm8k

vllm serve DeepSeek-V2-Lite-Chat --gpu-memory-utilization 0.9 --tensor-parallel-size 2 --decode-context-parallel-size 2 --cp-kv-cache-interleave-size 64

Test Result

tp2 dcp2, original code

dataset	version	metric	mode	vllm-api-stream-chat
gsm8k	7cd45e	accuracy	gen	67.85

tp2 dcp2, interleave_size = 1

dataset	version	metric	mode	vllm-api-stream-chat
gsm8k	7cd45e	accuracy	gen	67.85

tp2 dcp2, interleave_size = 64

dataset	version	metric	mode	vllm-api-stream-chat
gsm8k	7cd45e	accuracy	gen	67.55

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

github-actions · 2025-10-13T09:44:24Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

gemini-code-assist

Code Review

This pull request introduces support for configurable interleave size for KV cache in Decode Context Parallelism (DCP), which is a nice enhancement for flexibility. The changes also include refactoring the dcp_local_seq_lens computation into a utility function. The implementation is mostly solid, but I've identified a couple of areas for improvement. One is a misleading error message in an assertion, and the other is an opportunity to refactor a new utility function for better readability and efficiency. Addressing these points will improve the code quality.

vllm/config/vllm.py

vllm/utils/__init__.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

vllm/v1/worker/gpu_model_runner.py

vllm/engine/arg_utils.py

LucasWilkinson

Thanks for your contribution! Requesting changes to prevent merging until the test results can be obtained 👍

LucasWilkinson · 2025-10-14T02:16:48Z

tests/distributed/test_context_parallel.py

@@ -52,6 +53,7 @@ def detailed(
        tp_base: int = 4,
        pp_base: int = 1,
        dcp_base: int = 1,
+        cp_kv_cache_interleave_size: int = 1,


please call this dcp_kv_cache_interleave_size

After prefill cp (#25852) is supported, this kv_cache_interleave_size will be used for both dcp and pcp, shall we keep this name for future usage?

sure; by that logic we should update dcp_local_seq_lens to cp_local_seq_lens too but we can do that in the pcp PR

LucasWilkinson · 2025-10-14T02:18:42Z

vllm/v1/worker/block_table.py

+        self,
+        req_indices: np.ndarray,
+        positions: np.ndarray,
+        cp_kv_cache_interleave_size: int = 1,


since this is a constant pass it via the init

Thanks for review, we have already passed it via init

LucasWilkinson · 2025-10-14T02:19:05Z

vllm/utils/__init__.py

@@ -3426,3 +3426,35 @@ def unique_filepath(fn: Callable[[int], Path]) -> Path:
        if not p.exists():
            return p
        i += 1
+
+
+def get_dcp_local_seq_lens(


we should find a better spot for this; this is too broad of a utils file for a feature specific utility

Now we put this function in vllm/v1/attention/backends/utils.py, same place as CommonAttentionMetadata.dcp_local_seq_lens definition, this should be a more appropriate spot

LucasWilkinson · 2025-10-14T02:34:18Z

vllm/v1/worker/gpu_model_runner.py

@@ -1276,6 +1287,14 @@ def _prepare_inputs(
                logits_indices
            )

+        # update seq_lens of decode reqs under DCP.
+        if self.dcp_world_size > 1:
+            self.dcp_local_seq_lens.gpu[:num_reqs] = get_dcp_local_seq_lens(


I think it might actually be better to compute get_dcp_local_seq_lens using host buffers and then do a non-blocking copy to self.dcp_local_seq_lens.gpu (see: CpuGpuBuffer.copy_to_gpu)

(then when async scheduling is enabled it will be overlapped)

Modified as suggested, thanks for review

youzhedian

LGTM.

youzhedian · 2025-10-14T02:45:24Z

vllm/v1/worker/gpu_model_runner.py

@@ -256,6 +258,11 @@ def __init__(
        self.is_multimodal_pruning_enabled = False
        self.max_model_len = model_config.max_model_len
        self.dcp_world_size = self.parallel_config.decode_context_parallel_size
+        try:
+            self.dcp_rank = get_dcp_group().rank_in_group


delay to get_dcp_local_seq_lens calling is better?

In some cases we might need to know how seq_len is split globally, instead of only local seq_len on current dcp_rank, for example in our current npu mla impl, we need the global seq_len split message to calculate a mask for following update_lse (if no kv_cache is stored on some (d)cp_ranks, then there's no need to do corresponding update_lse), so we think it's better to return the full seq_len split result from get_dcp_local_seq_lens, and each dcp_rank can select their corresponding part as needed.

nit: I think we can simplify this to:

self.dcp_rank = 0 if self.dcp_world_size <= 1 else get_dcp_group().rank_in_group

that way we'll still get the benefit of the assert in get_dcp_group() an if a test sets self.dcp_world_size > 1 it should be initializing the dcp group anyways

better way to get dcp_rank 👍 Modified as suggested

youzhedian · 2025-10-14T02:50:00Z

vllm/v1/worker/block_table.py

+        self,
+        req_indices: np.ndarray,
+        positions: np.ndarray,
+        cp_kv_cache_interleave_size: int = 1,


nit: maybe no default val is better

Thanks for review, now we pass this arg via init, since it's a constant

Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>

mergify · 2025-11-06T01:09:31Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @zhangsicheng5.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>

Signed-off-by: Qiu <qiuchunshuo@huawei.com>

Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>

LucasWilkinson

LGTM! Thanks for all the hardwork and sorry about the back and forth

we should refactor reorg_kvcache at somepoint (maybe use a triton kernel) but that can be done in the future; i appreciate the clarify comments/renaming!

Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>

Signed-off-by: zhangsicheng5 <zhangsicheng5@huawei.com> Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com> Signed-off-by: Qiu <qiuchunshuo@huawei.com> Co-authored-by: QiuChunshuo <qiuchunshuo@huawei.com>

Butterfingrz · 2025-12-01T01:45:43Z

Hello! I have a question: If I am not using PD disaggregation—specifically when deploying an MLA-based model with DCP8—would setting cp_kv_cache_interleave_size to 64 yield any performance gains compared to the default of 1?

pisceskkk · 2025-12-01T03:52:34Z

If I am not using PD disaggregation—specifically when deploying an MLA-based model with DCP8—would setting cp_kv_cache_interleave_size to 64 yield any performance gains compared to the default of 1?

Theoretically, when PD disaggregation is not used, the value of cp_kv_cache_interleave_size should have no impact on performance.

zhangsicheng5 requested review from LucasWilkinson, ProExpertProg, WoosukKwon, alexm-redhat, comaniac, hmellor, houseroad, mgoin, njhill, robertgshaw2-redhat, simon-mo, tlrmchlsmth, yewentao256, youkaichao and ywang96 as code owners October 13, 2025 09:44

mergify bot added the v1 label Oct 13, 2025

gemini-code-assist bot reviewed Oct 13, 2025

View reviewed changes

vllm/config/vllm.py Outdated Show resolved Hide resolved

vllm/utils/__init__.py Outdated Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Oct 13, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

zhangsicheng5 force-pushed the dev branch 4 times, most recently from aa23faa to 397fd51 Compare October 13, 2025 13:26

hmellor reviewed Oct 13, 2025

View reviewed changes

vllm/engine/arg_utils.py Outdated Show resolved Hide resolved

zhangsicheng5 force-pushed the dev branch 2 times, most recently from 656e08c to 46ec829 Compare October 14, 2025 02:02

LucasWilkinson requested changes Oct 14, 2025

View reviewed changes

LucasWilkinson reviewed Oct 14, 2025

View reviewed changes

youzhedian approved these changes Oct 14, 2025

View reviewed changes

zhangsicheng5 force-pushed the dev branch from 46ec829 to 7ea3f64 Compare October 14, 2025 07:45

pisceskkk force-pushed the dev branch 3 times, most recently from 6fce752 to 6ddf209 Compare November 5, 2025 01:56

pisceskkk added 2 commits November 5, 2025 10:20

[refactor] rename and clean code

d174942

Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>

[bugfix] fix NaN lse of _correct_attn_cp_out_kernel

1f8ffde

Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>

pisceskkk force-pushed the dev branch from 6ddf209 to 1f8ffde Compare November 5, 2025 02:22

mergify bot added the needs-rebase label Nov 6, 2025

pisceskkk mentioned this pull request Nov 6, 2025

[RFC]: Support Prefill Context Parallel (PCP) #25749

Open

9 tasks

pisceskkk force-pushed the dev branch from 1a94cb7 to 8f8659c Compare November 7, 2025 01:07

[refactor] rename and add comments

9a5d365

Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>

pisceskkk force-pushed the dev branch from 8f8659c to 9a5d365 Compare November 7, 2025 02:04

Merge branch 'main' into dev

b72b4ca

Signed-off-by: Qiu <qiuchunshuo@huawei.com>

mergify bot removed the needs-rebase label Nov 7, 2025

[lint]

7414d25

Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>

LucasWilkinson approved these changes Nov 7, 2025

View reviewed changes

LucasWilkinson enabled auto-merge (squash) November 7, 2025 03:31

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 7, 2025

LucasWilkinson removed the v1 label Nov 7, 2025

mergify bot added the v1 label Nov 7, 2025

[UT] add interleave_size

4489b6a

Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>

auto-merge was automatically disabled November 7, 2025 06:39
Head branch was pushed to by a user without write access

pisceskkk force-pushed the dev branch from 44c6f0a to 4489b6a Compare November 8, 2025 10:35

Merge branch 'main' into dev

b22c552

LucasWilkinson merged commit 2108a57 into vllm-project:main Nov 8, 2025
57 checks passed

robertgshaw2-redhat mentioned this pull request Nov 10, 2025

[CI] Fix Plugin Tests Tests #28413

Merged

5 tasks

Uh oh!

Conversation

zhangsicheng5 commented Oct 13, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

1. cp_kv_cache_interleave_size support

2. Move dcp_local_seq_lens computation to utils

Test Plan

Test Result

Uh oh!

github-actions bot commented Oct 13, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

LucasWilkinson left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

youzhedian left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Nov 6, 2025

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Butterfingrz commented Dec 1, 2025

Uh oh!

pisceskkk commented Dec 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

zhangsicheng5 commented Oct 13, 2025 •

edited by github-actions bot

Loading

LucasWilkinson left a comment •

edited

Loading