[P/D][Feature] Support kv_transfer_params for parallel sampling (n>1)#38900
[P/D][Feature] Support kv_transfer_params for parallel sampling (n>1)#38900chaunceyjiang wants to merge 13 commits into
Conversation
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request introduces support for list-based kv_transfer_params to facilitate disaggregated serving during parallel sampling. The changes update the OpenAI protocol schemas, modify RequestOutput to aggregate parameters from multiple completions, and update the V1 engine to handle per-child parameter distribution. However, the review identifies several critical issues: the use of deepcopy and the disabling of caching for parallel sampling introduce significant performance regressions. Additionally, the parameter aggregation logic in both ParentRequest and RequestOutput contains bugs that could lead to incorrect ordering, duplicate entries, and broken streaming support. There is also a potential AttributeError in the sampling parameter logic when extra_args is not provided.
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
c10fca8 to
f4a096f
Compare
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
|
Thanks for the PR! Just one thought: Instead of aggregating |
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
@ZhanqiuHu Done. |
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
|
@ZhanqiuHu Ready for review. |
ZhanqiuHu
left a comment
There was a problem hiding this comment.
Thanks for the update! Added few minor suggestions.
ZhanqiuHu
left a comment
There was a problem hiding this comment.
@NickLucche I left a few comments; appreciate your thoughts when you get a chance.
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
|
@NickLucche @ZhanqiuHu Ready for review. |
ZhanqiuHu
left a comment
There was a problem hiding this comment.
LGTM, thanks for the quick updates!
NickLucche
left a comment
There was a problem hiding this comment.
Thanks for the patience @chaunceyjiang , this makes sense to me.
Would appreciate an ack from @njhill due to the changes in outputs.py to make sure we're aligned
| prompt_logprobs: list[dict[int, Logprob] | None] | None = None | ||
| prompt_token_ids: list[int] | None = None | ||
| kv_transfer_params: dict[str, Any] | None = Field( | ||
| kv_transfer_params: dict[str, Any] | list[dict[str, Any]] | None = Field( |
There was a problem hiding this comment.
nit: we could also comment here when we expect a list
Yeah I'm not sure we should do this since it will break backwards compatibility of the python API. At least we should still keep in RequestOutput for n=1 case, but that then makes things a bit messy. Instead of the approach in this PR, could we instead keep a single transfer just have secondary sub-requests share the common prefix of blocks? I guess that would only be an issue if the block size is large and/or we want to avoid doing even intra-block prefill on the decode side... |
@njhill As I mentioned in the PR, the issue here isn't just about KV blocks . it's also about the communication between prefill and decode nodes. Right now, notifications are tied to a single request_id, and kv_transfer_params only carries one request_id. Also, if I recall correctly, the nixl connector already avoids duplicate transfers when two requests share the same blocks,see _read_blocks . it only transfers the last block that differs. So the current PR is already doing a single transfer. It just also sends the corresponding request_id alongside it. |
@njhill It's currently compatible. You can take another look at the test results I pasted in the PR description. |
| self.kv_transfer_params = kv_transfer_params | ||
|
|
||
| @property | ||
| def kv_transfer_params(self) -> dict[str, Any] | list[dict[str, Any]] | None: |
There was a problem hiding this comment.
it will break backwards compatibility of the python API
@njhill I’ve also taken this case into consideration. I’ve added some backward-compatible handling here, and you can see that the constructor supports **kwargs: Any.
So whether it’s initialization or accessing the kv_transfer_params attribute, it should remain compatible.
HI @njhill For the case where request.n > 1, I tried a few different approaches, but it looks like we still need to pass the child-request request_id from prefill. Otherwise, Nixl’s prefill won’t clean up the child-request caches, and they’ll just stick around until they expire. Could you share a bit more detail on what you had in mind with this approach? |
|
This pull request has merge conflicts that must be resolved before it can be |
|
@chaunceyjiang May I ask if there has been any progress on this? |
Waiting for @njhill's review. |
Before this PR, P/D (Prefill/Decode) did not support parallel sampling (
n > 1).The reason is that when
request.n > 1(e.g.,request.n = 3), the Prefill node will split the request into three child requests. These child requests are then processed independently by the EngineCore, resulting in differentrequest_ids andkv_block_ids. For example:request_id = [0_chatcmpl-87d0ef24-418b, 1_chatcmpl-87d0ef24-418b, 2_chatcmpl-87d0ef24-418b]kv_block_id = [[1,2,3,4,5,9], [1,2,3,4,5,10], [1,2,3,4,5,11]]However, the original design of
kv_transfer_paramsonly supports transferring therequest_idandkv_block_idfor a single request. As a result, the KV blocks for the other child requests are not transferred.In addition, the Decode stage relies on
request_idto notify the Prefill stage to release resources. This limitation effectively restricts Prefill to handling only one request, while the remaining child requests cannot be properly processed.To address this, this PR introduces a backward-compatible change to
kv_transfer_params. Whenrequest.n > 1, it allows notifying and handling all child requests instead of just a single one.Purpose
Support kv_transfer_params for parallel sampling (n>1)
Test Plan
Test Result
request.n = 3
request.n = 1
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.