[NIXL] Support P tensor-parallel-size > D tensor-parallel-size#27274
[NIXL] Support P tensor-parallel-size > D tensor-parallel-size#27274DarkLight1337 merged 5 commits intovllm-project:mainfrom
Conversation
|
cc @GuanLuo let me know if this PR meets the expected set of features you aimed to get with your work. Thank you! |
| @@ -16,11 +16,13 @@ def __init__( | |||
| finished_sending: set[str] | None = None, | |||
There was a problem hiding this comment.
ignore file, to be rebased once #26734 lands
| @@ -413,7 +413,8 @@ def get_required_kvcache_layout(cls, vllm_config: "VllmConfig") -> str | None: | |||
| def get_finished_count(self) -> int | None: | |||
| """ | |||
| Get the count of requests expected to complete send/receive operations | |||
| via this connector. | |||
| via this connector. This method is used to initialize the | |||
There was a problem hiding this comment.
ignore, to be rebased once #26734 lands
| tp_ratio, | ||
| ) | ||
|
|
||
| ### (Optional) Register local agent memory regions. MLA is not split. |
| @@ -1593,16 +1712,14 @@ def _read_blocks( | |||
|
|
|||
| # Number of D TP workers that will read from dst P. Propagate tp_ratio | |||
| # on notification so that dst worker can wait before freeing blocks. | |||
| tp_ratio = self.kv_topo.tp_ratio_from_engine_id(dst_engine_id) | |||
| # Cap to 1 when P TP > D TP: only a single rank will read from remote. | |||
| tp_ratio = max(1, self.kv_topo.tp_ratio_from_engine_id(dst_engine_id)) | |||
There was a problem hiding this comment.
this is to have P only wait for 1 request instead of -tp_ratio
| @@ -4,10 +4,9 @@ | |||
| KV cache helper for store. | |||
There was a problem hiding this comment.
ignore file, to be rebased once #26734 lands
vllm/executor/executor_base.py
Outdated
| @@ -6,7 +6,7 @@ | |||
| from abc import ABC, abstractmethod | |||
There was a problem hiding this comment.
ignore file, to be rebased once #26734 lands
| @@ -160,9 +160,7 @@ def __init__( | |||
| ) | |||
There was a problem hiding this comment.
ignore, to be rebased once #26734 lands
| @@ -86,8 +86,14 @@ class KVConnectorOutput: | |||
| finished_recving: set[str] | None = None | |||
There was a problem hiding this comment.
ignore, to be rebased once #26734 lands
|
This pull request has merge conflicts that must be resolved before it can be |
054e7ff to
77577b0
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
54cf766 to
78ef532
Compare
|
PR's now ready for review! |
|
cc @xuechendi for xpu |
|
This pull request has merge conflicts that must be resolved before it can be |
|
@zhenwei-intel , please help to review, thx |
78ef532 to
2d08fa7
Compare
2d08fa7 to
4feab2c
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
4feab2c to
3899d23
Compare
|
|
||
| # Number of NIXL regions. Currently one region per cache | ||
| # (so 1 per layer for MLA, otherwise 2 per layer) | ||
| self.num_regions = 0 | ||
| self.num_layers = 0 | ||
|
|
||
| # nixl_prepped_dlist_handle. | ||
| self.src_xfer_side_handle: int = 0 |
There was a problem hiding this comment.
dropped default self.src_xfer_side_handle in favor of
self.src_xfer_handles_by_block_size[self.block_size]
| if self.use_mla and tp_ratio < 0: | ||
| # ..but we still need to notify the other remote ranks that we | ||
| # have the blocks we need so they can update the request state. |
There was a problem hiding this comment.
important mla logic
|
PR is verified with heter_block_size test, and it looks good. |
njhill
left a comment
There was a problem hiding this comment.
I took a first pass, looks very cleanly done, really awesome work @NickLucche!
Just a few style suggestions, I have not gone through all of the logic in detail yet, will try to spend a bit more time on that but it looks pretty solid!
xinyu-intel
left a comment
There was a problem hiding this comment.
verified with vllm-gaudi
3899d23 to
e8dd92c
Compare
| @@ -1857,7 +1937,7 @@ def _pop_done_transfers(self, transfers: dict[str, list[int]]) -> set[str]: | |||
| """ | |||
| done_req_ids: set[str] = set() | |||
| for req_id, handles in list(transfers.items()): | |||
| in_progress = False | |||
| in_progress = [] | |||
There was a problem hiding this comment.
prev logic was broken for multiple transfers.
| self.src_xfer_side_handle = 0 | ||
| for dst_xfer_side_handle in self.dst_xfer_side_handles.values(): | ||
| self.nixl_wrapper.release_dlist_handle(dst_xfer_side_handle) | ||
| for handle in self.src_xfer_handles_by_block_size.values(): |
There was a problem hiding this comment.
just matching handles structures changes
|
This pull request has merge conflicts that must be resolved before it can be |
more MLA tests Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
25816c5 to
0fd8705
Compare
…project#27274) Signed-off-by: NickLucche <nlucches@redhat.com>
…project#27274) Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: Ubuntu <mjtaheri68@gmail.com>
…project#27274) Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
…project#27274) Signed-off-by: NickLucche <nlucches@redhat.com>
Overview
This PR addresses the following case, P tensor-parallel-size > D tensor-parallel-size.
I think it helps to differentiate two main cases
MLA
For MLA model, the workflow is easier: each D worker reads from some other single P worker (fan-out reads to avoid all reading from same remote), as MLA cache is duplicated. Some P workers will not be read from at all.
Mind that this also holds for the DP/EP deployment, where TP size on D will often be 1!
From PR #23917, which also serves as good use-case. Btw as explained in that PR, the number of requests to "expect" is indeed the number of remote instances reading from P.
The main issue to implement that in Nixl is that each P worker will track requests as they come in (
_reqs_to_send,_reqs_to_process) and those structures are only cleared properly when a read is detected (o/w timeouts would be raised on P).To address that, I am allowing MLA D ranks to only execute one transfer, but notifying all affected remote that the read is completed (sending multiple nixl notifs).
cc @njhill @markmc
Dense
For dense models, every D worker will read from n P workers to re-compose its own KV cache, where n is referred to as
tp_ratioin code.This is possible because number of heads on P is H/n that of D's, so you can efficiently read into D's cache using HND layout. That is, in memory, you're just laying out flat ND tensors H/n , n times
Side note: current design is flexible and allows for dynamic discovery of remotes with different tp_sizes. However this is not a feature that is currently supported, but it helps to take into account when considering impl choices. It's more of an optional route I'd like to keep open.
Changes
The main change this PR needs to allow is for a D worker to read from multiple P's.
Practical edits this PR introduces to do so:
src_xfer_side_chunked_handles: local regions need to be split differently based on how many remotes we want to read from. This is prepared during handshake, once .[engine_id][rank_no]to accomodate the aboveget_target_remote->get_target_remotesfor the same reason, + a bunch of for loops over its resulttp_ratioextension to indicate remote P size greater than D_pop_done_transfersHow to test
And check out tp_config_sweep_accuracy with config:
TODO
Coming soon to this PR:
[ ] On MLA with DP/EP, avoid having all workers read from same remotedeferringIt does NOT support replicated KV heads scenario,
tp_size>num_heads. This is definitely doable, just I believe on weak demand atm so we can postpone it.