[P/D][Feature] Support kv_transfer_params for parallel sampling (n>1) by chaunceyjiang · Pull Request #38900 · vllm-project/vllm

chaunceyjiang · 2026-04-03T09:24:30Z

Before this PR, P/D (Prefill/Decode) did not support parallel sampling (n > 1).

The reason is that when request.n > 1 (e.g., request.n = 3), the Prefill node will split the request into three child requests. These child requests are then processed independently by the EngineCore, resulting in different request_ids and kv_block_ids. For example:

request_id = [0_chatcmpl-87d0ef24-418b, 1_chatcmpl-87d0ef24-418b, 2_chatcmpl-87d0ef24-418b]
kv_block_id = [[1,2,3,4,5,9], [1,2,3,4,5,10], [1,2,3,4,5,11]]

However, the original design of kv_transfer_params only supports transferring the request_id and kv_block_id for a single request. As a result, the KV blocks for the other child requests are not transferred.

In addition, the Decode stage relies on request_id to notify the Prefill stage to release resources. This limitation effectively restricts Prefill to handling only one request, while the remaining child requests cannot be properly processed.

To address this, this PR introduces a backward-compatible change to kv_transfer_params. When request.n > 1, it allows notifying and handling all child requests instead of just a single one.

Purpose

Support kv_transfer_params for parallel sampling (n>1)

Test Plan

messages = [{"role": "user", "content": "9.11 and 9.8, which is lower?"}]
response = client.chat.completions.create(
    model="my-model",
    messages=messages,
    stream=stream,
    n=3,
)
for choice in response.choices:
  print(f"Choice {choice.index}:")
  print(choice.message.content)
  print()

Test Result

Choice 0:
....
**Answer:**  
9.11 is lower.

Choice 1:
....
**Answer**:  
**9.11 is lower than 9.8.**

Choice 2:
...
4. **Conclusion**:  
   **9.11** is lower than **9.8**.

**Answer**:  
**9.11 is lower than 9.8.**

ver pid=967379) INFO 04-03 18:42:22 [metrics.py:103] KV Transfer metrics: Num successful transfers=3, Avg xfer time (ms)=3.637, P90 xfer time (ms)=4.696, Avg post time (ms)=0.928, P90 post time (ms)=1.235, Avg MB per transfer=2.25, Throughput (MB/s)=618.642, Avg number of descriptors=72.0

request.n = 3

{
  "id": "chatcmpl-87d0ef24-418b-4224-b436-6f31226848ae",
  "object": "chat.completion",
  "created": 1775208649,
  "model": "my-model",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": null,
        "refusal": null,
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning": ""
      },
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null,
      "token_ids": null
    },
    {
      "index": 1,
      "message": {
        "role": "assistant",
        "content": null,
        "refusal": null,
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning": ""
      },
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null,
      "token_ids": null
    },
    {
      "index": 2,
      "message": {
        "role": "assistant",
        "content": null,
        "refusal": null,
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning": ""
      },
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null,
      "token_ids": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 576,
    "total_tokens": 579,
    "completion_tokens": 3,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null,
  "prompt_token_ids": null,
  "kv_transfer_params": [            # <<<<- all requests and all kv_block_id
    {
      "do_remote_prefill": true,
      "do_remote_decode": false,
      "remote_block_ids": [
        [
          1,
          2,
          3,
          4,
          5,
          6,
          7,
.....
          25,
          26,
          27,
          28,
          29,
          30,
          31,
          32,
          33,
          34,
          35,
          40
        ]
      ],
      "remote_engine_id": "777e6cbd-af31-4305-a3c0-b335bdcc9867",
      "remote_request_id": "0_chatcmpl-87d0ef24-418b-4224-b436-6f31226848ae-b296b0ae",
      "remote_host": "localhost",
      "remote_port": 6700,
      "tp_size": 1
    },
    {
      "do_remote_prefill": true,
      "do_remote_decode": false,
      "remote_block_ids": [
        [
          1,
          2,
          3,
          4,
          5,
          6,
          7,
  .....
          31,
          32,
          33,
          34,
          35,
          41
        ]
      ],
      "remote_engine_id": "777e6cbd-af31-4305-a3c0-b335bdcc9867",
      "remote_request_id": "1_chatcmpl-87d0ef24-418b-4224-b436-6f31226848ae-b296b0ae",
      "remote_host": "localhost",
      "remote_port": 6700,
      "tp_size": 1
    },
    {
      "do_remote_prefill": true,
      "do_remote_decode": false,
      "remote_block_ids": [
        [
          1,
          2,
          3,
          4,
          5,
          6,
.....
          30,
          31,
          32,
          33,
          34,
          35,
          42
        ]
      ],
      "remote_engine_id": "777e6cbd-af31-4305-a3c0-b335bdcc9867",
      "remote_request_id": "2_chatcmpl-87d0ef24-418b-4224-b436-6f31226848ae-b296b0ae",
      "remote_host": "localhost",
      "remote_port": 6700,
      "tp_size": 1
    }
  ]
}

request.n = 1

{
  "id": "chatcmpl-c757b78f-4258-4022-8ef9-0d9e9712b87c",
  "object": "chat.completion",
  "created": 1775208779,
  "model": "my-model",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": null,
        "refusal": null,
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning": ""
      },
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null,
      "token_ids": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 576,
    "total_tokens": 577,
    "completion_tokens": 1,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null,
  "prompt_token_ids": null,
  "kv_transfer_params": {           # <<<<- one request and one kv_block_id
    "do_remote_prefill": true,
    "do_remote_decode": false,
    "remote_block_ids": [
      [
        1,
        2,
        3,
        4,
        5,
        6,
        7,
   ....
        31,
        32,
        33,
        34,
        35,
        43
      ]
    ],
    "remote_engine_id": "777e6cbd-af31-4305-a3c0-b335bdcc9867",
    "remote_request_id": "chatcmpl-c757b78f-4258-4022-8ef9-0d9e9712b87c-85f8a2e3",
    "remote_host": "localhost",
    "remote_port": 6700,
    "tp_size": 1
  }
}

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

gemini-code-assist

Code Review

This pull request introduces support for list-based kv_transfer_params to facilitate disaggregated serving during parallel sampling. The changes update the OpenAI protocol schemas, modify RequestOutput to aggregate parameters from multiple completions, and update the V1 engine to handle per-child parameter distribution. However, the review identifies several critical issues: the use of deepcopy and the disabling of caching for parallel sampling introduce significant performance regressions. Additionally, the parameter aggregation logic in both ParentRequest and RequestOutput contains bugs that could lead to incorrect ordering, duplicate entries, and broken streaming support. There is also a potential AttributeError in the sampling parameter logic when extra_args is not provided.

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

ZhanqiuHu · 2026-04-06T14:38:19Z

Thanks for the PR! Just one thought: Instead of aggregating kv_transfer_params into a list on RequestOutput, would it be better that we put it to each CompletionOutput instead? So we would have kv_transfer_params for each CompletionOutput.

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

chaunceyjiang · 2026-04-07T02:50:40Z

would it be better that we put it to each CompletionOutput instead? So we would have kv_transfer_params for each CompletionOutput.

@ZhanqiuHu Done.

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

chaunceyjiang · 2026-04-07T13:08:22Z

@ZhanqiuHu Ready for review.

ZhanqiuHu

Thanks for the update! Added few minor suggestions.

ZhanqiuHu

@NickLucche I left a few comments; appreciate your thoughts when you get a chance.

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

chaunceyjiang · 2026-04-08T04:29:27Z

@NickLucche @ZhanqiuHu Ready for review.

ZhanqiuHu

LGTM, thanks for the quick updates!

NickLucche

Thanks for the patience @chaunceyjiang , this makes sense to me.
Would appreciate an ack from @njhill due to the changes in outputs.py to make sure we're aligned

NickLucche · 2026-04-09T17:06:11Z

    prompt_logprobs: list[dict[int, Logprob] | None] | None = None
    prompt_token_ids: list[int] | None = None
-    kv_transfer_params: dict[str, Any] | None = Field(
+    kv_transfer_params: dict[str, Any] | list[dict[str, Any]] | None = Field(


nit: we could also comment here when we expect a list

njhill · 2026-04-09T18:35:22Z

Thanks for the PR! Just one thought: Instead of aggregating kv_transfer_params into a list on RequestOutput, would it be better that we put it to each CompletionOutput instead? So we would have kv_transfer_params for each CompletionOutput.

Yeah I'm not sure we should do this since it will break backwards compatibility of the python API. At least we should still keep in RequestOutput for n=1 case, but that then makes things a bit messy.

Instead of the approach in this PR, could we instead keep a single transfer just have secondary sub-requests share the common prefix of blocks? I guess that would only be an issue if the block size is large and/or we want to avoid doing even intra-block prefill on the decode side...

chaunceyjiang · 2026-04-10T00:05:52Z

Instead of the approach in this PR, could we instead keep a single transfer just have secondary sub-requests share the common prefix of blocks? I guess that would only be an issue if the block size is large and/or we want to avoid doing even intra-block prefill on the decode side..

@njhill
Sorry, I'm a bit confused.

As I mentioned in the PR, the issue here isn't just about KV blocks . it's also about the communication between prefill and decode nodes. Right now, notifications are tied to a single request_id, and kv_transfer_params only carries one request_id.

Also, if I recall correctly, the nixl connector already avoids duplicate transfers when two requests share the same blocks,see _read_blocks . it only transfers the last block that differs.

So the current PR is already doing a single transfer. It just also sends the corresponding request_id alongside it.

chaunceyjiang · 2026-04-10T00:11:10Z

At least we should still keep in RequestOutput for n=1 case

@njhill It's currently compatible. You can take another look at the test results I pasted in the PR description.

chaunceyjiang · 2026-04-10T03:35:00Z

-        self.kv_transfer_params = kv_transfer_params
+
+    @property
+    def kv_transfer_params(self) -> dict[str, Any] | list[dict[str, Any]] | None:


it will break backwards compatibility of the python API

@njhill I’ve also taken this case into consideration. I’ve added some backward-compatible handling here, and you can see that the constructor supports **kwargs: Any.

So whether it’s initialization or accessing the kv_transfer_params attribute, it should remain compatible.

chaunceyjiang · 2026-04-15T02:47:54Z

Instead of the approach in this PR, could we instead keep a single transfer just have secondary sub-requests share the common prefix of blocks?

HI @njhill

For the case where request.n > 1, I tried a few different approaches, but it looks like we still need to pass the child-request request_id from prefill. Otherwise, Nixl’s prefill won’t clean up the child-request caches, and they’ll just stick around until they expire. Could you share a bit more detail on what you had in mind with this approach?

mergify · 2026-05-23T07:56:49Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @chaunceyjiang.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

NUABO · 2026-06-03T08:22:15Z

@chaunceyjiang May I ask if there has been any progress on this?

chaunceyjiang · 2026-06-03T08:28:08Z

@chaunceyjiang May I ask if there has been any progress on this?

@NUABO

Waiting for @njhill's review.

[Feature] Support kv_transfer_params for parallel sampling (n>1) in PD

d4c20ef

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

mergify Bot added frontend v1 kv-connector labels Apr 3, 2026

gemini-code-assist Bot reviewed Apr 3, 2026

View reviewed changes

chaunceyjiang added 4 commits April 3, 2026 17:48

[Feature] Support kv_transfer_params for parallel sampling (n>1) in PD

0836848

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

[Feature] Support kv_transfer_params for parallel sampling (n>1) in PD

b7482d8

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

[Feature] Support kv_transfer_params for parallel sampling (n>1) in PD

cc9a763

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

[Feature] Support kv_transfer_params for parallel sampling (n>1) in PD

f4a096f

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

chaunceyjiang force-pushed the pd_with_request_n branch from c10fca8 to f4a096f Compare April 3, 2026 10:25

[Feature] Support kv_transfer_params for parallel sampling (n>1) in PD

2eddc95

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

chaunceyjiang marked this pull request as ready for review April 3, 2026 10:32

chaunceyjiang requested review from DarkLight1337, aarnphm, njhill and russellb as code owners April 3, 2026 10:32

[Feature] Support kv_transfer_params for parallel sampling (n>1) in PD

6e53cde

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

[Feature] Support kv_transfer_params for parallel sampling (n>1) in PD

eae6076

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

chaunceyjiang added 2 commits April 7, 2026 10:52

[Feature] Support kv_transfer_params for parallel sampling (n>1) in PD

16d9bde

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

[Feature] Support kv_transfer_params for parallel sampling (n>1) in PD

003d3ee

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

vllm-project deleted a comment from mergify Bot Apr 7, 2026

[Feature] Support kv_transfer_params for parallel sampling (n>1) in PD

2327dfe

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

ZhanqiuHu reviewed Apr 8, 2026

View reviewed changes

Comment thread vllm/outputs.py Outdated

Comment thread vllm/entrypoints/openai/chat_completion/protocol.py

Comment thread vllm/v1/engine/parallel_sampling.py Outdated

ZhanqiuHu reviewed Apr 8, 2026

View reviewed changes

[Feature] Support kv_transfer_params for parallel sampling (n>1) in PD

dd58080

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

chaunceyjiang requested a review from ZhanqiuHu April 8, 2026 03:37

[Feature] Support kv_transfer_params for parallel sampling (n>1) in PD

a221751

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

vllm-project deleted a comment from mergify Bot Apr 8, 2026

chaunceyjiang requested review from NickLucche and removed request for ZhanqiuHu April 8, 2026 04:29

ZhanqiuHu reviewed Apr 8, 2026

View reviewed changes

chaunceyjiang added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 9, 2026

NickLucche reviewed Apr 9, 2026

View reviewed changes

chaunceyjiang commented Apr 10, 2026

View reviewed changes

mergify Bot added the needs-rebase label May 23, 2026

crazyguitar mentioned this pull request May 24, 2026

[Bugfix][NIXL] Fix best_of_n KV leak and early-notification race in P/D #43509

Open

4 tasks

Uh oh!

Conversation

chaunceyjiang commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ZhanqiuHu commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chaunceyjiang commented Apr 7, 2026

Uh oh!

chaunceyjiang commented Apr 7, 2026

Uh oh!

ZhanqiuHu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ZhanqiuHu left a comment

Choose a reason for hiding this comment

Uh oh!

chaunceyjiang commented Apr 8, 2026

Uh oh!

ZhanqiuHu left a comment

Choose a reason for hiding this comment

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

NickLucche Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

njhill commented Apr 9, 2026

Uh oh!

chaunceyjiang commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chaunceyjiang commented Apr 10, 2026

Uh oh!

chaunceyjiang Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

chaunceyjiang commented Apr 15, 2026

Uh oh!

mergify Bot commented May 23, 2026

Uh oh!

NUABO commented Jun 3, 2026

Uh oh!

chaunceyjiang commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

chaunceyjiang commented Apr 3, 2026 •

edited

Loading

ZhanqiuHu commented Apr 6, 2026 •

edited

Loading

chaunceyjiang commented Apr 10, 2026 •

edited

Loading