Skip to content

[P/D][Feature] Support kv_transfer_params for parallel sampling (n>1)#38900

Open
chaunceyjiang wants to merge 13 commits into
vllm-project:mainfrom
chaunceyjiang:pd_with_request_n
Open

[P/D][Feature] Support kv_transfer_params for parallel sampling (n>1)#38900
chaunceyjiang wants to merge 13 commits into
vllm-project:mainfrom
chaunceyjiang:pd_with_request_n

Conversation

@chaunceyjiang
Copy link
Copy Markdown
Collaborator

@chaunceyjiang chaunceyjiang commented Apr 3, 2026

Before this PR, P/D (Prefill/Decode) did not support parallel sampling (n > 1).

The reason is that when request.n > 1 (e.g., request.n = 3), the Prefill node will split the request into three child requests. These child requests are then processed independently by the EngineCore, resulting in different request_ids and kv_block_ids. For example:

  • request_id = [0_chatcmpl-87d0ef24-418b, 1_chatcmpl-87d0ef24-418b, 2_chatcmpl-87d0ef24-418b]
  • kv_block_id = [[1,2,3,4,5,9], [1,2,3,4,5,10], [1,2,3,4,5,11]]

However, the original design of kv_transfer_params only supports transferring the request_id and kv_block_id for a single request. As a result, the KV blocks for the other child requests are not transferred.

In addition, the Decode stage relies on request_id to notify the Prefill stage to release resources. This limitation effectively restricts Prefill to handling only one request, while the remaining child requests cannot be properly processed.

To address this, this PR introduces a backward-compatible change to kv_transfer_params. When request.n > 1, it allows notifying and handling all child requests instead of just a single one.

Purpose

Support kv_transfer_params for parallel sampling (n>1)

Test Plan

messages = [{"role": "user", "content": "9.11 and 9.8, which is lower?"}]
response = client.chat.completions.create(
    model="my-model",
    messages=messages,
    stream=stream,
    n=3,
)
for choice in response.choices:
  print(f"Choice {choice.index}:")
  print(choice.message.content)
  print()

Test Result

Choice 0:
....
**Answer:**  
9.11 is lower.

Choice 1:
....
**Answer**:  
**9.11 is lower than 9.8.**

Choice 2:
...
4. **Conclusion**:  
   **9.11** is lower than **9.8**.

**Answer**:  
**9.11 is lower than 9.8.**


ver pid=967379) INFO 04-03 18:42:22 [metrics.py:103] KV Transfer metrics: Num successful transfers=3, Avg xfer time (ms)=3.637, P90 xfer time (ms)=4.696, Avg post time (ms)=0.928, P90 post time (ms)=1.235, Avg MB per transfer=2.25, Throughput (MB/s)=618.642, Avg number of descriptors=72.0


request.n = 3

{
  "id": "chatcmpl-87d0ef24-418b-4224-b436-6f31226848ae",
  "object": "chat.completion",
  "created": 1775208649,
  "model": "my-model",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": null,
        "refusal": null,
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning": ""
      },
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null,
      "token_ids": null
    },
    {
      "index": 1,
      "message": {
        "role": "assistant",
        "content": null,
        "refusal": null,
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning": ""
      },
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null,
      "token_ids": null
    },
    {
      "index": 2,
      "message": {
        "role": "assistant",
        "content": null,
        "refusal": null,
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning": ""
      },
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null,
      "token_ids": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 576,
    "total_tokens": 579,
    "completion_tokens": 3,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null,
  "prompt_token_ids": null,
  "kv_transfer_params": [            # <<<<- all requests and all kv_block_id
    {
      "do_remote_prefill": true,
      "do_remote_decode": false,
      "remote_block_ids": [
        [
          1,
          2,
          3,
          4,
          5,
          6,
          7,
.....
          25,
          26,
          27,
          28,
          29,
          30,
          31,
          32,
          33,
          34,
          35,
          40
        ]
      ],
      "remote_engine_id": "777e6cbd-af31-4305-a3c0-b335bdcc9867",
      "remote_request_id": "0_chatcmpl-87d0ef24-418b-4224-b436-6f31226848ae-b296b0ae",
      "remote_host": "localhost",
      "remote_port": 6700,
      "tp_size": 1
    },
    {
      "do_remote_prefill": true,
      "do_remote_decode": false,
      "remote_block_ids": [
        [
          1,
          2,
          3,
          4,
          5,
          6,
          7,
  .....
          31,
          32,
          33,
          34,
          35,
          41
        ]
      ],
      "remote_engine_id": "777e6cbd-af31-4305-a3c0-b335bdcc9867",
      "remote_request_id": "1_chatcmpl-87d0ef24-418b-4224-b436-6f31226848ae-b296b0ae",
      "remote_host": "localhost",
      "remote_port": 6700,
      "tp_size": 1
    },
    {
      "do_remote_prefill": true,
      "do_remote_decode": false,
      "remote_block_ids": [
        [
          1,
          2,
          3,
          4,
          5,
          6,
.....
          30,
          31,
          32,
          33,
          34,
          35,
          42
        ]
      ],
      "remote_engine_id": "777e6cbd-af31-4305-a3c0-b335bdcc9867",
      "remote_request_id": "2_chatcmpl-87d0ef24-418b-4224-b436-6f31226848ae-b296b0ae",
      "remote_host": "localhost",
      "remote_port": 6700,
      "tp_size": 1
    }
  ]
}

request.n = 1

{
  "id": "chatcmpl-c757b78f-4258-4022-8ef9-0d9e9712b87c",
  "object": "chat.completion",
  "created": 1775208779,
  "model": "my-model",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": null,
        "refusal": null,
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning": ""
      },
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null,
      "token_ids": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 576,
    "total_tokens": 577,
    "completion_tokens": 1,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null,
  "prompt_token_ids": null,
  "kv_transfer_params": {           # <<<<- one request and one kv_block_id
    "do_remote_prefill": true,
    "do_remote_decode": false,
    "remote_block_ids": [
      [
        1,
        2,
        3,
        4,
        5,
        6,
        7,
   ....
        31,
        32,
        33,
        34,
        35,
        43
      ]
    ],
    "remote_engine_id": "777e6cbd-af31-4305-a3c0-b335bdcc9867",
    "remote_request_id": "chatcmpl-c757b78f-4258-4022-8ef9-0d9e9712b87c-85f8a2e3",
    "remote_host": "localhost",
    "remote_port": 6700,
    "tp_size": 1
  }
}



Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for list-based kv_transfer_params to facilitate disaggregated serving during parallel sampling. The changes update the OpenAI protocol schemas, modify RequestOutput to aggregate parameters from multiple completions, and update the V1 engine to handle per-child parameter distribution. However, the review identifies several critical issues: the use of deepcopy and the disabling of caching for parallel sampling introduce significant performance regressions. Additionally, the parameter aggregation logic in both ParentRequest and RequestOutput contains bugs that could lead to incorrect ordering, duplicate entries, and broken streaming support. There is also a potential AttributeError in the sampling parameter logic when extra_args is not provided.

Comment thread vllm/v1/engine/parallel_sampling.py Outdated
Comment thread vllm/v1/engine/parallel_sampling.py Outdated
Comment thread vllm/v1/engine/parallel_sampling.py Outdated
Comment thread vllm/v1/engine/parallel_sampling.py Outdated
Comment thread vllm/v1/engine/parallel_sampling.py Outdated
Comment thread vllm/outputs.py Outdated
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
@chaunceyjiang chaunceyjiang marked this pull request as ready for review April 3, 2026 10:32
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
@ZhanqiuHu
Copy link
Copy Markdown
Contributor

ZhanqiuHu commented Apr 6, 2026

Thanks for the PR! Just one thought: Instead of aggregating kv_transfer_params into a list on RequestOutput, would it be better that we put it to each CompletionOutput instead? So we would have kv_transfer_params for each CompletionOutput.

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
@chaunceyjiang
Copy link
Copy Markdown
Collaborator Author

would it be better that we put it to each CompletionOutput instead? So we would have kv_transfer_params for each CompletionOutput.

@ZhanqiuHu Done.

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
@vllm-project vllm-project deleted a comment from mergify Bot Apr 7, 2026
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
@chaunceyjiang
Copy link
Copy Markdown
Collaborator Author

@ZhanqiuHu Ready for review.

Copy link
Copy Markdown
Contributor

@ZhanqiuHu ZhanqiuHu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update! Added few minor suggestions.

Comment thread vllm/outputs.py Outdated
Comment thread vllm/entrypoints/openai/chat_completion/protocol.py
Comment thread vllm/v1/engine/parallel_sampling.py Outdated
Copy link
Copy Markdown
Contributor

@ZhanqiuHu ZhanqiuHu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NickLucche I left a few comments; appreciate your thoughts when you get a chance.

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
@chaunceyjiang chaunceyjiang requested a review from ZhanqiuHu April 8, 2026 03:37
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
@vllm-project vllm-project deleted a comment from mergify Bot Apr 8, 2026
@chaunceyjiang
Copy link
Copy Markdown
Collaborator Author

@NickLucche @ZhanqiuHu Ready for review.

@chaunceyjiang chaunceyjiang requested review from NickLucche and removed request for ZhanqiuHu April 8, 2026 04:29
Copy link
Copy Markdown
Contributor

@ZhanqiuHu ZhanqiuHu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the quick updates!

@chaunceyjiang chaunceyjiang added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 9, 2026
Copy link
Copy Markdown
Member

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patience @chaunceyjiang , this makes sense to me.
Would appreciate an ack from @njhill due to the changes in outputs.py to make sure we're aligned

prompt_logprobs: list[dict[int, Logprob] | None] | None = None
prompt_token_ids: list[int] | None = None
kv_transfer_params: dict[str, Any] | None = Field(
kv_transfer_params: dict[str, Any] | list[dict[str, Any]] | None = Field(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we could also comment here when we expect a list

@njhill
Copy link
Copy Markdown
Member

njhill commented Apr 9, 2026

Thanks for the PR! Just one thought: Instead of aggregating kv_transfer_params into a list on RequestOutput, would it be better that we put it to each CompletionOutput instead? So we would have kv_transfer_params for each CompletionOutput.

Yeah I'm not sure we should do this since it will break backwards compatibility of the python API. At least we should still keep in RequestOutput for n=1 case, but that then makes things a bit messy.

Instead of the approach in this PR, could we instead keep a single transfer just have secondary sub-requests share the common prefix of blocks? I guess that would only be an issue if the block size is large and/or we want to avoid doing even intra-block prefill on the decode side...

@chaunceyjiang
Copy link
Copy Markdown
Collaborator Author

chaunceyjiang commented Apr 10, 2026

Instead of the approach in this PR, could we instead keep a single transfer just have secondary sub-requests share the common prefix of blocks? I guess that would only be an issue if the block size is large and/or we want to avoid doing even intra-block prefill on the decode side..

@njhill
Sorry, I'm a bit confused.

As I mentioned in the PR, the issue here isn't just about KV blocks . it's also about the communication between prefill and decode nodes. Right now, notifications are tied to a single request_id, and kv_transfer_params only carries one request_id.

Also, if I recall correctly, the nixl connector already avoids duplicate transfers when two requests share the same blocks,see _read_blocks . it only transfers the last block that differs.

So the current PR is already doing a single transfer. It just also sends the corresponding request_id alongside it.

@chaunceyjiang
Copy link
Copy Markdown
Collaborator Author

At least we should still keep in RequestOutput for n=1 case

@njhill It's currently compatible. You can take another look at the test results I pasted in the PR description.

Comment thread vllm/outputs.py
self.kv_transfer_params = kv_transfer_params

@property
def kv_transfer_params(self) -> dict[str, Any] | list[dict[str, Any]] | None:
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will break backwards compatibility of the python API

@njhill I’ve also taken this case into consideration. I’ve added some backward-compatible handling here, and you can see that the constructor supports **kwargs: Any.

So whether it’s initialization or accessing the kv_transfer_params attribute, it should remain compatible.

@chaunceyjiang
Copy link
Copy Markdown
Collaborator Author

Instead of the approach in this PR, could we instead keep a single transfer just have secondary sub-requests share the common prefix of blocks?

HI @njhill

For the case where request.n > 1, I tried a few different approaches, but it looks like we still need to pass the child-request request_id from prefill. Otherwise, Nixl’s prefill won’t clean up the child-request caches, and they’ll just stick around until they expire. Could you share a bit more detail on what you had in mind with this approach?

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 23, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @chaunceyjiang.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@NUABO
Copy link
Copy Markdown

NUABO commented Jun 3, 2026

@chaunceyjiang May I ask if there has been any progress on this?

@chaunceyjiang
Copy link
Copy Markdown
Collaborator Author

@chaunceyjiang May I ask if there has been any progress on this?

@NUABO

Waiting for @njhill's review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend kv-connector needs-rebase ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants