Disaggregate prefill for kv cache register style#950
Disaggregate prefill for kv cache register style#950ganyi1996ppo merged 96 commits intovllm-project:mainfrom
Conversation
6f37f50 to
cf491c8
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
96a3508 to
cc4f9fb
Compare
c72fdad to
6b8ce80
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
38ec528 to
69da5b3
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
41507c7 to
303cacd
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
70e22d8 to
1bd2222
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
linfeng-yuan
left a comment
There was a problem hiding this comment.
I think we need to add compatibility here for $1 and $2 in different nodes.
779a3a5 to
95082f3
Compare
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Signed-off-by: underfituu <hzhucong@163.com>
Signed-off-by: underfituu <hzhucong@163.com>
Signed-off-by: underfituu <hzhucong@163.com>
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
wangxiyuan
left a comment
There was a problem hiding this comment.
Let's merge this first to unblock others
| ACL_FORMAT_FRACTAL_ND), | ||
| ) | ||
| dtype = kv_cache_spec.dtype | ||
| if self.model_config.is_deepseek_mla: |
There was a problem hiding this comment.
Why was the judgment changed from "self.torchair_graph_enabled" to "self.model_config.is_deepseek_mla"?
There was a problem hiding this comment.
Its actually the same, becuase over that time only deepseek support torchair, then pangu get in with torchair combined with mha, but the code is written specific for mla. So we change this condition which is more accurate.
| kv_cache, | ||
| alignment)[:cache_size].view(cache_shape) | ||
| kv_cache_list.append(kv_cache) | ||
| kv_caches[layer_name] = tuple(kv_cache_list) |
There was a problem hiding this comment.
Why Why was the type changed from tensor to tuple, and how does this affect D2H and H2D?
There was a problem hiding this comment.
Its have nothing to do with d2h or h2d, its actually changed for the alignment limitation on ascend hardware, refer to https://github.com/vllm-project/vllm-ascend/pull/950/files/9c20fdda8111de05756ac4ed0a3c80cd776cfb34#diff-c49594855b615477bbc34f06d2d423a7dd84c021a7925cd1f61fdb79cb814c08R2064
There was a problem hiding this comment.
Could you provide more details on the alignment limitations of Ascend hardware? We are currently implementing a KV Cache Connector, and these modifications will affect the offload/load operations of the cache in the connector.
There was a problem hiding this comment.
We'd like to understand this modification in more detail to better adapt our implementation.
There was a problem hiding this comment.
The memory needs to be 4M aligned, that's all
What this PR does / why we need it?
This PR adopt
LLMDataDistfor kv cache register andpull_blocksstyle disaggregate prefill implementation. The interface implementation mainly follows the design of NIXL PR https://github.com/vllm-project/vllm/pull/17751/files#diff-7eaad0b7dee0626bf29d10081b0f0c5e3ea15a4af97e7b182a4e0d35f8346953 .This PR can be test with the following step:
toy_proxy.pyto launch the disaggregate prefill proxy server, specify the prefill ip, port and the decode ip, portDoes this PR introduce any user-facing change?
How was this patch tested?