[P/D] Add readme for PD separation by wangxiaoteng888 · Pull Request #4182 · vllm-project/vllm-ascend

wangxiaoteng888 · 2025-11-13T12:10:05Z

What this PR does / why we need it?

Add readme for PD separation

Does this PR introduce any user-facing change?

No

How was this patch tested?

By ci

vLLM version: v0.11.0
vLLM main: vllm-project/vllm@2918c1b

github-actions · 2025-11-13T12:10:14Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request adds a comprehensive README for prefill/decode (PD) separation. The documentation is detailed, but there are a few areas that need improvement for clarity and correctness. I've identified a duplicated command-line argument, some incorrect parameters in a table, a potentially broken link, and inconsistencies in configuration values. Additionally, the document structure could be improved to better explain the different configuration options presented. Addressing these points will significantly enhance the usability of this guide.

gemini-code-assist · 2025-11-13T12:11:54Z

-    --decoder-hosts 192.0.0.3\
-    --decoder-ports 8004
+  --port 1999 \
+  --host 192.0.0.1 \


The --host argument is duplicated. It's already specified on line 972. Please remove this redundant line to avoid confusion.

gemini-code-assist · 2025-11-13T12:11:54Z

+| --prefiller-hosts-num | Number of repetitions for prefiller node hosts |
+| --prefiller-ports | Ports of prefiller nodes |
+| --prefiller-ports-inc | Number of increments for prefiller node ports |
+| --decoder-hosts | Hosts of decoder nodes |
+| --decoder-hosts-num | Number of repetitions for decoder node hosts |
+| --decoder-ports | Ports of decoder nodes |
+| --decoder-ports-inc | Number of increments for decoder node ports |


The parameters --prefiller-hosts-num, --prefiller-ports-inc, --decoder-hosts-num, and --decoder-ports-inc described in this table do not appear to be supported by the load_balance_proxy_server_example.py script referenced later. This can be misleading for users. Please ensure the documentation accurately reflects the script's arguments.

gemini-code-assist · 2025-11-13T12:11:54Z

+| --decoder-hosts-num | Number of repetitions for decoder node hosts |
+| --decoder-ports | Ports of decoder nodes |
+| --decoder-ports-inc | Number of increments for decoder node ports |
+You can get the proxy program in the repository's examples, [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/v0.9.1-dev/examples/disaggregate_prefill_v1/load_balance_proxy_server_example.py)


This link points to a specific development branch (v0.9.1-dev) and an incorrect directory name (disaggregate_prefill_v1). This makes the link fragile and likely to break. It should be updated to point to the main branch and use the correct path for consistency with other links in this file.

Suggested change

You can get the proxy program in the repository's examples, [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/v0.9.1-dev/examples/disaggregate_prefill_v1/load_balance_proxy_server_example.py)

You can get the proxy program in the repository's examples, [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)

gemini-code-assist · 2025-11-13T12:11:55Z

+## Prefill & Decode Configuration Details
+
+In the PD separation scenario, we provide a optimized configuration. 


This section introduces a new configuration using LLMDataDistCMgrConnector without explaining how it relates to the Mooncake...Connector configurations detailed earlier in the document. This makes the documentation confusing. Please add context to clarify when each configuration should be used.

gemini-code-assist · 2025-11-13T12:11:55Z

+
+- **decoder node**
+
+1. set HCCL_BUFFSIZE=1024


There's an inconsistency in the recommended HCCL_BUFFSIZE for the decoder node. Here it is recommended to be 1024, but the example script on line 418 sets it to 600. Please resolve this discrepancy to avoid confusion.

leo-pony

please don't remove llmdatadist guide, as it is also need by A5 before 2026

whx-sjtu · 2025-11-20T04:08:33Z

 We can run the following scripts to launch a server on the prefiller/decoder node, respectively. Please note that each P/D node will occupy ports ranging from kv_port to kv_port + num_chips to initialize socket listeners. To avoid any issues, port conflicts should be prevented. Additionally, ensure that each node's engine_id is uniquely assigned to avoid conflicts.

-### Layerwise
+### launch_online_dp.py


refer to the script in example, don't just put all the codes in doc.

OK，we already refer to the script in example

Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>

Signed-off-by: liziyu <liziyu16@huawei.com>

liziyu179 · 2025-11-24T09:41:24Z

please don't remove llmdatadist guide, as it is also need by A5 before 2026

Mooncake and llmdatadist will work with A5 at the same time.

Signed-off-by: liziyu <liziyu16@huawei.com>

MengqingCao · 2025-11-26T02:14:57Z

+
 ## Example Proxy for Deployment

 Run a proxy server on the same node with the prefiller service instance. You can get the proxy program in the repository's examples: [load\_balance\_proxy\_layerwise\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_layerwise_server_example.py) or [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)


Let's make some description on the difference between layerwise proxy and the other

The layerwise proxy is engineered for the MooncakeLayerwiseConnector, which is configured to route inference requests to the P-node as the initial processing point.
Conversely, the non-layerwise proxy is designed for the MooncakeConnector, which directs inference requests to the D-node first.

Signed-off-by: liziyu <liziyu16@huawei.com>

### What this PR does / why we need it? Add readme for PD separation ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@2918c1b --------- Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Signed-off-by: liziyu <liziyu16@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com>

### What this PR does / why we need it? Add readme for PD separation ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@2918c1b --------- Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Signed-off-by: liziyu <liziyu16@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk>

### What this PR does / why we need it? Add readme for PD separation ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@2918c1b --------- Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Signed-off-by: liziyu <liziyu16@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com>

### What this PR does / why we need it? Add readme for PD separation ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@2918c1b --------- Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Signed-off-by: liziyu <liziyu16@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com> Signed-off-by: tanqingshan (A) <50050625@china.huawei.com>

### What this PR does / why we need it? Add readme for PD separation ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@2918c1b --------- Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Signed-off-by: liziyu <liziyu16@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com>

nwpu-zxr · 2025-12-12T04:21:03Z

+export HCCL_OP_EXPANSION_MODE="AIV"
+export VLLM_USE_V1=1
+export ASCEND_RT_VISIBLE_DEVICES=$1
+export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH


CANN8.3rc2 now supports transit transmission, use export ASCEND_BUFFER_POOL=4:8 to open, and use export ASCEND_BUFFER_POOL=0:0 to unable.

The environment variable ASCEND_TRANSFER_TIMEOUT sets the transmission timeout period, which should be included in the instructions.

A section should be set up to describe the relevant environment variable configurations, such as HCCL_INTRA_ROCE_ENABLE, etc.

nwpu-zxr · 2025-12-12T04:26:43Z

+  --tensor-parallel-size $7 \
  --enable-expert-parallel \
  --seed 1024 \
+  --served-model-name ds_r1 \


Mooncake connector now supports PCP/DCP, I think the tutorial should include corresponding examples.

nwpu-zxr · 2025-12-12T05:02:03Z

-  --gpu-memory-utilization 0.9 \
+  --gpu-memory-utilization 0.9  \
+  --quantization ascend \
+  --no-enable-prefix-caching \


For mooncake connector, to achieve optimal performance, we should add the --async-scheduling option to enable asynchronous scheduling on the script sample.

nwpu-zxr · 2025-12-12T13:17:07Z

+Modify `run_dp_template.py` on each node.
+[run\_dp\_template.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/external_online_dp/run_dp_template.sh)
+
+#### Layerwise


Considering the current situation, should we first introduce the mooncake connector and mark layerwise as experimental support?

nwpu-zxr · 2025-12-12T13:18:29Z

+-it $IMAGE bash
+```
+
 ## Install Mooncake


We should show the mooncake commit id at the beginning of the guide.

nwpu-zxr · 2025-12-12T13:19:45Z

 vLLM-Ascend now supports prefill-decode (PD) disaggregation with EP (Expert Parallel) options. This guide take one-by-one steps to verify these features with constrained resources.

-Take the Qwen3-235B model as an example, use 4 Atlas 800T A3 servers to deploy the "2P1D" architecture. Assume the ip of the prefiller server is 192.0.0.1 (prefill 1) and 192.0.0.2 (prefill 2), and the decoder servers are 192.0.0.3 (decoder 1) and 192.0.0.4 (decoder 2). On each server, use 8 NPUs 16 chips to deploy one service instance.
+Take the Deepseek-r1-w8a8 model as an example, use 4 Atlas 800T A3 servers to deploy the "2P1D" architecture. Assume the ip of the prefiller server is 192.0.0.1 (prefill 1) and 192.0.0.2 (prefill 2), and the decoder servers are 192.0.0.3 (decoder 1) and 192.0.0.4 (decoder 2). On each server, use 8 NPUs 16 chips to deploy one service instance.


We should add a table to display recommended parallel configurations for common setups such as 1P1D and 2P1D.

github-actions bot added the documentation Improvements or additions to documentation label Nov 13, 2025

gemini-code-assist bot reviewed Nov 13, 2025

View reviewed changes

wangxiaoteng888 force-pushed the add_distribute_readme branch from 556e025 to b61baf2 Compare November 14, 2025 08:18

leo-pony reviewed Nov 14, 2025

View reviewed changes

whx-sjtu requested changes Nov 20, 2025

View reviewed changes

wangxiaoteng888 and others added 4 commits November 24, 2025 10:38

origin_toy_readme

d72812b

Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>

fix_readme

61e896c

Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>

fit lint

7eefaec

Signed-off-by: liziyu <liziyu16@huawei.com>

fit lint

c696f40

Signed-off-by: liziyu <liziyu16@huawei.com>

liziyu179 force-pushed the add_distribute_readme branch 7 times, most recently from 3ac2754 to 189c704 Compare November 24, 2025 09:37

liziyu179 force-pushed the add_distribute_readme branch 4 times, most recently from c691f51 to b5fe640 Compare November 24, 2025 13:14

fit lint

f29e17d

Signed-off-by: liziyu <liziyu16@huawei.com>

liziyu179 force-pushed the add_distribute_readme branch from b5fe640 to f29e17d Compare November 24, 2025 13:47

add docker

5067b08

Signed-off-by: liziyu <liziyu16@huawei.com>

MengqingCao reviewed Nov 26, 2025

View reviewed changes

liziyu179 added 2 commits November 26, 2025 10:59

Add a description for the proxy

1b6a7d5

Signed-off-by: liziyu <liziyu16@huawei.com>

fix lint

247964b

Signed-off-by: liziyu <liziyu16@huawei.com>

MengqingCao approved these changes Nov 26, 2025

View reviewed changes

wangxiyuan approved these changes Nov 28, 2025

View reviewed changes

wangxiyuan merged commit 366d2d9 into vllm-project:main Nov 28, 2025
17 checks passed

nwpu-zxr reviewed Dec 12, 2025

View reviewed changes

	You can get the proxy program in the repository's examples, [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/v0.9.1-dev/examples/disaggregate_prefill_v1/load_balance_proxy_server_example.py)
	You can get the proxy program in the repository's examples, [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)

		## Prefill & Decode Configuration Details

		In the PD separation scenario, we provide a optimized configuration.


		## Example Proxy for Deployment

		Run a proxy server on the same node with the prefiller service instance. You can get the proxy program in the repository's examples: [load\_balance\_proxy\_layerwise\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_layerwise_server_example.py) or [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)

Conversation

wangxiaoteng888 commented Nov 13, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Nov 13, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

leo-pony left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liziyu179 commented Nov 24, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

wangxiaoteng888 commented Nov 13, 2025 •

edited by github-actions bot

Loading