[Disaggregated Prefill] P2P Disaggregated Prefill based on llm_datadist#694
[Disaggregated Prefill] P2P Disaggregated Prefill based on llm_datadist#694ganyi1996ppo merged 16 commits intovllm-project:mainfrom
Conversation
wuhuikx
left a comment
There was a problem hiding this comment.
Can we have a readme.md in examples/disaggreated_prefill to guide the users for running? may including:
- disaggreated_prefill_offline.sh
- disaggreated_prefill_online.sh
e76fddc to
ae658ea
Compare
| options["ge.exec.deviceId"] = str(self.rank) | ||
| print(f"prepare datadist, options: {options}") | ||
| self.data_dist.init(options) | ||
| self.kv_transfer = self.data_dist.kv_cache_manager |
There was a problem hiding this comment.
Why not use datadist cache_manager, cache_manger also has allocate_cache? Is there any comparison between cache_manger and kv_cache_manger in transfer performance?
There was a problem hiding this comment.
In fact, I do not have a deep understanding of llm datadist. In the current version, functions are streamlined based on kv_cache_manager. In the future, I will consider using cache manager to implement and compare the performance between them.
There was a problem hiding this comment.
@LCAIZJ Would you be able to share any resources detailing how cache_manager differs from kv_cache_manager?
f57ae1c to
1007fa7
Compare
24f8d07 to
3975e7b
Compare
Signed-off-by: hw_whx <wanghexiang7@huawei.com>
Signed-off-by: hw_whx <wanghexiang7@huawei.com>
Signed-off-by: hw_whx <wanghexiang7@huawei.com>
Signed-off-by: hw_whx <wanghexiang7@huawei.com>
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Signed-off-by: hw_whx <wanghexiang7@huawei.com>
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Signed-off-by: hw_whx <wanghexiang7@huawei.com>
…e same path Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
| # Free cache_id of buffer, actual deallocate will happen after consumer performing pull_cache. | ||
| self.kv_transfer.deallocate_cache(buffer) |
There was a problem hiding this comment.
When the decode tensor parallelism size exceeds the prefill tensor parallelism size, the same buffer in a prefill node may receive multiple pull requests. However, there appears to be a potential issue: the buffer gets deallocated after the first pull request is processed. Could this be causing errors?
There was a problem hiding this comment.
Indeed, current version might fail in this situation. Maybe we can maintain a buffer list in simple_buffer.py with FIFO.
There was a problem hiding this comment.
Could you clarify your idea? For hybrid parallelism, I think we either need to manually manage buffer lifecycles or accept storage redundancy.
|
This PR also supports dummy run in engine v0, when one dp rank receives a request, the other dp ranks will also receives a idle request to make sure all dp comm group can reach to same collective call thus no rank will be blocked on it. However, this support is limited and can only be valid under below 3 scenarios:
|
…om during e2e test Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
|
This disaggregate prefill feature is still under experimental phase. Further test and development is required in the future, please update the tutorial and unittest about this feature after this PR merged @whx-sjtu . |
### What this PR does / why we need it? #### 1. fix spec ut in vllm-ascend main and vllm main As #694 and #749 verify, Now, vllm-ascend main and vllm 0.8.5, spec UT is happy, but vllm-ascend main and vllm main, CI is fail. I found the reason is a triton bug triton-lang/triton#2266, but i I didn't figure it out that why the bug did not effect vllm-ascend main and vllm 0.8.5, maybe the usage of triton have changed when vllm 0.8.5 to latest main As the bug describe, I changed the minimum block_size in UT from 8 to 16, and the modification is verified locally to be effective. #### 2. modify some case skip form. I modified some commented out cases to skipif form, which is more standardized. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? CI Signed-off-by: mengwei805 <mengwei25@huawei.com>
…st (vllm-project#694) ### What this PR does / why we need it? - This PR proposes a P2P version of Disaggregated Prefill based on llm_datadist which manages data transfer. - This solution reconstructs previous offline single-node Disaggregated Prefill solution, and supports multi-node and online serveing now. - Currently this solution supports 1P1D situation of Deepseek hybrid parallelism (P: TP+EP, D: DP+EP). Note that xPyD situation is considered in the solution design, and will be supported soon within v1 engine. --------- Signed-off-by: hw_whx <wanghexiang7@huawei.com> Signed-off-by: ganyi <pleaplusone.gy@gmail.com> Co-authored-by: hw_whx <wanghexiang7@huawei.com> Co-authored-by: ganyi <pleaplusone.gy@gmail.com>
### What this PR does / why we need it? #### 1. fix spec ut in vllm-ascend main and vllm main As vllm-project#694 and vllm-project#749 verify, Now, vllm-ascend main and vllm 0.8.5, spec UT is happy, but vllm-ascend main and vllm main, CI is fail. I found the reason is a triton bug triton-lang/triton#2266, but i I didn't figure it out that why the bug did not effect vllm-ascend main and vllm 0.8.5, maybe the usage of triton have changed when vllm 0.8.5 to latest main As the bug describe, I changed the minimum block_size in UT from 8 to 16, and the modification is verified locally to be effective. #### 2. modify some case skip form. I modified some commented out cases to skipif form, which is more standardized. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? CI Signed-off-by: mengwei805 <mengwei25@huawei.com>
…st (vllm-project#694) ### What this PR does / why we need it? - This PR proposes a P2P version of Disaggregated Prefill based on llm_datadist which manages data transfer. - This solution reconstructs previous offline single-node Disaggregated Prefill solution, and supports multi-node and online serveing now. - Currently this solution supports 1P1D situation of Deepseek hybrid parallelism (P: TP+EP, D: DP+EP). Note that xPyD situation is considered in the solution design, and will be supported soon within v1 engine. --------- Signed-off-by: hw_whx <wanghexiang7@huawei.com> Signed-off-by: ganyi <pleaplusone.gy@gmail.com> Co-authored-by: hw_whx <wanghexiang7@huawei.com> Co-authored-by: ganyi <pleaplusone.gy@gmail.com>
### What this PR does / why we need it? #### 1. fix spec ut in vllm-ascend main and vllm main As vllm-project#694 and vllm-project#749 verify, Now, vllm-ascend main and vllm 0.8.5, spec UT is happy, but vllm-ascend main and vllm main, CI is fail. I found the reason is a triton bug triton-lang/triton#2266, but i I didn't figure it out that why the bug did not effect vllm-ascend main and vllm 0.8.5, maybe the usage of triton have changed when vllm 0.8.5 to latest main As the bug describe, I changed the minimum block_size in UT from 8 to 16, and the modification is verified locally to be effective. #### 2. modify some case skip form. I modified some commented out cases to skipif form, which is more standardized. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? CI Signed-off-by: mengwei805 <mengwei25@huawei.com>
What this PR does / why we need it?
This PR proposes a P2P version of Disaggregated Prefill based on llm_datadist which manages data transfer.
This solution reconstructs previous offline single-node Disaggregated Prefill solution, and supports multi-node and online serveing now.
Currently this solution supports 1P1D situation of Deepseek hybrid parallelism (P: TP+EP, D: DP+EP). Note that xPyD situation is considered in the solution design, and will be supported soon within v1 engine.