Skip to content

[Disaggregated Prefill] P2P Disaggregated Prefill based on llm_datadist#694

Merged
ganyi1996ppo merged 16 commits intovllm-project:mainfrom
whx-sjtu:p2p_pd_sep
May 1, 2025
Merged

[Disaggregated Prefill] P2P Disaggregated Prefill based on llm_datadist#694
ganyi1996ppo merged 16 commits intovllm-project:mainfrom
whx-sjtu:p2p_pd_sep

Conversation

@whx-sjtu
Copy link
Collaborator

@whx-sjtu whx-sjtu commented Apr 28, 2025

What this PR does / why we need it?

  • This PR proposes a P2P version of Disaggregated Prefill based on llm_datadist which manages data transfer.

  • This solution reconstructs previous offline single-node Disaggregated Prefill solution, and supports multi-node and online serveing now.

  • Currently this solution supports 1P1D situation of Deepseek hybrid parallelism (P: TP+EP, D: DP+EP). Note that xPyD situation is considered in the solution design, and will be supported soon within v1 engine.

Copy link
Contributor

@wuhuikx wuhuikx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a readme.md in examples/disaggreated_prefill to guide the users for running? may including:

  1. disaggreated_prefill_offline.sh
  2. disaggreated_prefill_online.sh

@whx-sjtu whx-sjtu force-pushed the p2p_pd_sep branch 5 times, most recently from e76fddc to ae658ea Compare April 30, 2025 12:04
options["ge.exec.deviceId"] = str(self.rank)
print(f"prepare datadist, options: {options}")
self.data_dist.init(options)
self.kv_transfer = self.data_dist.kv_cache_manager
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use datadist cache_manager, cache_manger also has allocate_cache? Is there any comparison between cache_manger and kv_cache_manger in transfer performance?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, I do not have a deep understanding of llm datadist. In the current version, functions are streamlined based on kv_cache_manager. In the future, I will consider using cache manager to implement and compare the performance between them.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LCAIZJ Would you be able to share any resources detailing how cache_manager differs from kv_cache_manager?

@whx-sjtu whx-sjtu force-pushed the p2p_pd_sep branch 2 times, most recently from f57ae1c to 1007fa7 Compare May 1, 2025 05:55
@whx-sjtu whx-sjtu force-pushed the p2p_pd_sep branch 2 times, most recently from 24f8d07 to 3975e7b Compare May 1, 2025 06:49
hw_whx and others added 10 commits May 1, 2025 14:50
Signed-off-by: hw_whx <wanghexiang7@huawei.com>
Signed-off-by: hw_whx <wanghexiang7@huawei.com>
Signed-off-by: hw_whx <wanghexiang7@huawei.com>
Signed-off-by: hw_whx <wanghexiang7@huawei.com>
Signed-off-by: hw_whx <wanghexiang7@huawei.com>
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Signed-off-by: hw_whx <wanghexiang7@huawei.com>
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Signed-off-by: hw_whx <wanghexiang7@huawei.com>
…e same path

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Comment on lines +182 to +183
# Free cache_id of buffer, actual deallocate will happen after consumer performing pull_cache.
self.kv_transfer.deallocate_cache(buffer)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the decode tensor parallelism size exceeds the prefill tensor parallelism size, the same buffer in a prefill node may receive multiple pull requests. However, there appears to be a potential issue: the buffer gets deallocated after the first pull request is processed. Could this be causing errors?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, current version might fail in this situation. Maybe we can maintain a buffer list in simple_buffer.py with FIFO.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify your idea? For hybrid parallelism, I think we either need to manually manage buffer lifecycles or accept storage redundancy.

@ganyi1996ppo
Copy link
Collaborator

This PR also supports dummy run in engine v0, when one dp rank receives a request, the other dp ranks will also receives a idle request to make sure all dp comm group can reach to same collective call thus no rank will be blocked on it.

However, this support is limited and can only be valid under below 3 scenarios:

  • When target model dose not do collective call over dp group, the service is expected to be fine at anything time.
  • When target model may enter both decode and prefill phase(normal vllm serve case) and needs to do collective call over dp group, both prefill and decode path should always have the same collective call within dp group and happens at same position.
  • If prefill and decode path have different collective call over the same dp group inside the target model, then only disaggregate prefill and decode can promise the functionality (with the strong assumption that the kv_cache and hidden_states can always get properly received if its a vaild request which have already processed by prefill node)

…om during e2e test

Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
@ganyi1996ppo ganyi1996ppo changed the title [Disaggregated Prefill][WIP] P2P Disaggregated Prefill based on llm_datadist [Disaggregated Prefill] P2P Disaggregated Prefill based on llm_datadist May 1, 2025
@ganyi1996ppo
Copy link
Collaborator

This disaggregate prefill feature is still under experimental phase. Further test and development is required in the future, please update the tutorial and unittest about this feature after this PR merged @whx-sjtu .

@ganyi1996ppo ganyi1996ppo merged commit 8b194ad into vllm-project:main May 1, 2025
16 checks passed
wangxiyuan pushed a commit that referenced this pull request May 10, 2025
### What this PR does / why we need it?
#### 1. fix spec ut in vllm-ascend main and vllm main
As #694 and
#749 verify, Now,
vllm-ascend main and vllm 0.8.5, spec UT is happy, but vllm-ascend main
and vllm main, CI is fail.

I found the reason is a triton bug
triton-lang/triton#2266, but i I didn't figure
it out that why the bug did not effect vllm-ascend main and vllm 0.8.5,
maybe the usage of triton have changed when vllm 0.8.5 to latest main

As the bug describe, I changed the minimum block_size in UT from 8 to
16, and the modification is verified locally to be effective.

#### 2. modify some case skip form.
I modified some commented out cases to skipif form, which is more
standardized.

### Does this PR introduce _any_ user-facing change?
None

### How was this patch tested?
CI

Signed-off-by: mengwei805 <mengwei25@huawei.com>
@MengqingCao MengqingCao mentioned this pull request May 14, 2025
13 tasks
chopper0126 pushed a commit to chopper0126/vllm-ascend that referenced this pull request Oct 16, 2025
…st (vllm-project#694)

### What this PR does / why we need it?
- This PR proposes a P2P version of Disaggregated Prefill based on
llm_datadist which manages data transfer.

- This solution reconstructs previous offline single-node Disaggregated
Prefill solution, and supports multi-node and online serveing now.

- Currently this solution supports 1P1D situation of Deepseek hybrid
parallelism (P: TP+EP, D: DP+EP). Note that xPyD situation is considered
in the solution design, and will be supported soon within v1 engine.

---------

Signed-off-by: hw_whx <wanghexiang7@huawei.com>
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Co-authored-by: hw_whx <wanghexiang7@huawei.com>
Co-authored-by: ganyi <pleaplusone.gy@gmail.com>
chopper0126 pushed a commit to chopper0126/vllm-ascend that referenced this pull request Oct 16, 2025
### What this PR does / why we need it?
#### 1. fix spec ut in vllm-ascend main and vllm main
As vllm-project#694 and
vllm-project#749 verify, Now,
vllm-ascend main and vllm 0.8.5, spec UT is happy, but vllm-ascend main
and vllm main, CI is fail.

I found the reason is a triton bug
triton-lang/triton#2266, but i I didn't figure
it out that why the bug did not effect vllm-ascend main and vllm 0.8.5,
maybe the usage of triton have changed when vllm 0.8.5 to latest main

As the bug describe, I changed the minimum block_size in UT from 8 to
16, and the modification is verified locally to be effective.

#### 2. modify some case skip form.
I modified some commented out cases to skipif form, which is more
standardized.

### Does this PR introduce _any_ user-facing change?
None

### How was this patch tested?
CI

Signed-off-by: mengwei805 <mengwei25@huawei.com>
Angazenn pushed a commit to Angazenn/vllm-ascend that referenced this pull request Oct 21, 2025
…st (vllm-project#694)

### What this PR does / why we need it?
- This PR proposes a P2P version of Disaggregated Prefill based on
llm_datadist which manages data transfer.

- This solution reconstructs previous offline single-node Disaggregated
Prefill solution, and supports multi-node and online serveing now.

- Currently this solution supports 1P1D situation of Deepseek hybrid
parallelism (P: TP+EP, D: DP+EP). Note that xPyD situation is considered
in the solution design, and will be supported soon within v1 engine.

---------

Signed-off-by: hw_whx <wanghexiang7@huawei.com>
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Co-authored-by: hw_whx <wanghexiang7@huawei.com>
Co-authored-by: ganyi <pleaplusone.gy@gmail.com>
Angazenn pushed a commit to Angazenn/vllm-ascend that referenced this pull request Oct 21, 2025
### What this PR does / why we need it?
#### 1. fix spec ut in vllm-ascend main and vllm main
As vllm-project#694 and
vllm-project#749 verify, Now,
vllm-ascend main and vllm 0.8.5, spec UT is happy, but vllm-ascend main
and vllm main, CI is fail.

I found the reason is a triton bug
triton-lang/triton#2266, but i I didn't figure
it out that why the bug did not effect vllm-ascend main and vllm 0.8.5,
maybe the usage of triton have changed when vllm 0.8.5 to latest main

As the bug describe, I changed the minimum block_size in UT from 8 to
16, and the modification is verified locally to be effective.

#### 2. modify some case skip form.
I modified some commented out cases to skipif form, which is more
standardized.

### Does this PR introduce _any_ user-facing change?
None

### How was this patch tested?
CI

Signed-off-by: mengwei805 <mengwei25@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants