Skip to content

[P/D] Add readme for PD separation#4182

Merged
wangxiyuan merged 8 commits intovllm-project:mainfrom
wangxiaoteng888:add_distribute_readme
Nov 28, 2025
Merged

[P/D] Add readme for PD separation#4182
wangxiyuan merged 8 commits intovllm-project:mainfrom
wangxiaoteng888:add_distribute_readme

Conversation

@wangxiaoteng888
Copy link
Copy Markdown
Contributor

@wangxiaoteng888 wangxiaoteng888 commented Nov 13, 2025

What this PR does / why we need it?

Add readme for PD separation

Does this PR introduce any user-facing change?

No

How was this patch tested?

By ci

@github-actions
Copy link
Copy Markdown
Contributor

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Nov 13, 2025
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a comprehensive README for prefill/decode (PD) separation. The documentation is detailed, but there are a few areas that need improvement for clarity and correctness. I've identified a duplicated command-line argument, some incorrect parameters in a table, a potentially broken link, and inconsistencies in configuration values. Additionally, the document structure could be improved to better explain the different configuration options presented. Addressing these points will significantly enhance the usability of this guide.

--decoder-hosts 192.0.0.3\
--decoder-ports 8004
--port 1999 \
--host 192.0.0.1 \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The --host argument is duplicated. It's already specified on line 972. Please remove this redundant line to avoid confusion.

Comment on lines +1086 to +1092
| --prefiller-hosts-num | Number of repetitions for prefiller node hosts |
| --prefiller-ports | Ports of prefiller nodes |
| --prefiller-ports-inc | Number of increments for prefiller node ports |
| --decoder-hosts | Hosts of decoder nodes |
| --decoder-hosts-num | Number of repetitions for decoder node hosts |
| --decoder-ports | Ports of decoder nodes |
| --decoder-ports-inc | Number of increments for decoder node ports |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The parameters --prefiller-hosts-num, --prefiller-ports-inc, --decoder-hosts-num, and --decoder-ports-inc described in this table do not appear to be supported by the load_balance_proxy_server_example.py script referenced later. This can be misleading for users. Please ensure the documentation accurately reflects the script's arguments.

| --decoder-hosts-num | Number of repetitions for decoder node hosts |
| --decoder-ports | Ports of decoder nodes |
| --decoder-ports-inc | Number of increments for decoder node ports |
You can get the proxy program in the repository's examples, [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/v0.9.1-dev/examples/disaggregate_prefill_v1/load_balance_proxy_server_example.py)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This link points to a specific development branch (v0.9.1-dev) and an incorrect directory name (disaggregate_prefill_v1). This makes the link fragile and likely to break. It should be updated to point to the main branch and use the correct path for consistency with other links in this file.

Suggested change
You can get the proxy program in the repository's examples, [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/v0.9.1-dev/examples/disaggregate_prefill_v1/load_balance_proxy_server_example.py)
You can get the proxy program in the repository's examples, [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)

Comment on lines +1149 to +1151
## Prefill & Decode Configuration Details

In the PD separation scenario, we provide a optimized configuration.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This section introduces a new configuration using LLMDataDistCMgrConnector without explaining how it relates to the Mooncake...Connector configurations detailed earlier in the document. This makes the documentation confusing. Please add context to clarify when each configuration should be used.


- **decoder node**

1. set HCCL_BUFFSIZE=1024
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There's an inconsistency in the recommended HCCL_BUFFSIZE for the decoder node. Here it is recommended to be 1024, but the example script on line 418 sets it to 600. Please resolve this discrepancy to avoid confusion.

Copy link
Copy Markdown
Collaborator

@leo-pony leo-pony left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please don't remove llmdatadist guide, as it is also need by A5 before 2026

We can run the following scripts to launch a server on the prefiller/decoder node, respectively. Please note that each P/D node will occupy ports ranging from kv_port to kv_port + num_chips to initialize socket listeners. To avoid any issues, port conflicts should be prevented. Additionally, ensure that each node's engine_id is uniquely assigned to avoid conflicts.

### Layerwise
### launch_online_dp.py
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refer to the script in example, don't just put all the codes in doc.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK,we already refer to the script in example

wangxiaoteng888 and others added 4 commits November 24, 2025 10:38
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
@liziyu179 liziyu179 force-pushed the add_distribute_readme branch 7 times, most recently from 3ac2754 to 189c704 Compare November 24, 2025 09:37
@liziyu179
Copy link
Copy Markdown
Collaborator

please don't remove llmdatadist guide, as it is also need by A5 before 2026

Mooncake and llmdatadist will work with A5 at the same time.

@liziyu179 liziyu179 force-pushed the add_distribute_readme branch 4 times, most recently from c691f51 to b5fe640 Compare November 24, 2025 13:14
Signed-off-by: liziyu <liziyu16@huawei.com>
@liziyu179 liziyu179 force-pushed the add_distribute_readme branch from b5fe640 to f29e17d Compare November 24, 2025 13:47
Signed-off-by: liziyu <liziyu16@huawei.com>

## Example Proxy for Deployment

Run a proxy server on the same node with the prefiller service instance. You can get the proxy program in the repository's examples: [load\_balance\_proxy\_layerwise\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_layerwise_server_example.py) or [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make some description on the difference between layerwise proxy and the other

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The layerwise proxy is engineered for the MooncakeLayerwiseConnector, which is configured to route inference requests to the P-node as the initial processing point.
Conversely, the non-layerwise proxy is designed for the MooncakeConnector, which directs inference requests to the D-node first.

Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
@wangxiyuan wangxiyuan merged commit 366d2d9 into vllm-project:main Nov 28, 2025
17 checks passed
ChenCangtao pushed a commit to ChenCangtao/vllm-ascend that referenced this pull request Dec 3, 2025
### What this PR does / why we need it?
Add readme for PD separation

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By ci

- vLLM version: v0.11.0
- vLLM main:
vllm-project/vllm@2918c1b

---------

Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
Co-authored-by: liziyu <liziyu16@huawei.com>
Mercykid-bash pushed a commit to Mercykid-bash/vllm-ascend that referenced this pull request Dec 4, 2025
### What this PR does / why we need it?
Add readme for PD separation

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By ci

- vLLM version: v0.11.0
- vLLM main:
vllm-project/vllm@2918c1b

---------

Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
Co-authored-by: liziyu <liziyu16@huawei.com>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
Mercykid-bash pushed a commit to Mercykid-bash/vllm-ascend that referenced this pull request Dec 4, 2025
### What this PR does / why we need it?
Add readme for PD separation

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By ci

- vLLM version: v0.11.0
- vLLM main:
vllm-project/vllm@2918c1b

---------

Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
Co-authored-by: liziyu <liziyu16@huawei.com>
Signed-off-by: Che Ruan <cr623@ic.ac.uk>
Meihan-chen pushed a commit to Meihan-chen/vllm-ascend that referenced this pull request Dec 5, 2025
### What this PR does / why we need it?
Add readme for PD separation

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By ci

- vLLM version: v0.11.0
- vLLM main:
vllm-project/vllm@2918c1b

---------

Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
Co-authored-by: liziyu <liziyu16@huawei.com>
Clorist33 pushed a commit to Clorist33/vllm-ascend that referenced this pull request Dec 9, 2025
### What this PR does / why we need it?
Add readme for PD separation

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By ci

- vLLM version: v0.11.0
- vLLM main:
vllm-project/vllm@2918c1b

---------

Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
Co-authored-by: liziyu <liziyu16@huawei.com>
Signed-off-by: tanqingshan (A) <50050625@china.huawei.com>
Clorist33 pushed a commit to Clorist33/vllm-ascend that referenced this pull request Dec 10, 2025
### What this PR does / why we need it?
Add readme for PD separation

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By ci

- vLLM version: v0.11.0
- vLLM main:
vllm-project/vllm@2918c1b

---------

Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
Co-authored-by: liziyu <liziyu16@huawei.com>
Mercykid-bash pushed a commit to Mercykid-bash/vllm-ascend that referenced this pull request Dec 10, 2025
### What this PR does / why we need it?
Add readme for PD separation

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By ci

- vLLM version: v0.11.0
- vLLM main:
vllm-project/vllm@2918c1b

---------

Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
Co-authored-by: liziyu <liziyu16@huawei.com>
export HCCL_OP_EXPANSION_MODE="AIV"
export VLLM_USE_V1=1
export ASCEND_RT_VISIBLE_DEVICES=$1
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CANN8.3rc2 now supports transit transmission, use export ASCEND_BUFFER_POOL=4:8 to open, and use export ASCEND_BUFFER_POOL=0:0 to unable.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The environment variable ASCEND_TRANSFER_TIMEOUT sets the transmission timeout period, which should be included in the instructions.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A section should be set up to describe the relevant environment variable configurations, such as HCCL_INTRA_ROCE_ENABLE, etc.

--tensor-parallel-size $7 \
--enable-expert-parallel \
--seed 1024 \
--served-model-name ds_r1 \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mooncake connector now supports PCP/DCP, I think the tutorial should include corresponding examples.

--gpu-memory-utilization 0.9 \
--gpu-memory-utilization 0.9 \
--quantization ascend \
--no-enable-prefix-caching \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For mooncake connector, to achieve optimal performance, we should add the --async-scheduling option to enable asynchronous scheduling on the script sample.

Modify `run_dp_template.py` on each node.
[run\_dp\_template.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/external_online_dp/run_dp_template.sh)

#### Layerwise
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering the current situation, should we first introduce the mooncake connector and mark layerwise as experimental support?

-it $IMAGE bash
```

## Install Mooncake
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should show the mooncake commit id at the beginning of the guide.

vLLM-Ascend now supports prefill-decode (PD) disaggregation with EP (Expert Parallel) options. This guide take one-by-one steps to verify these features with constrained resources.

Take the Qwen3-235B model as an example, use 4 Atlas 800T A3 servers to deploy the "2P1D" architecture. Assume the ip of the prefiller server is 192.0.0.1 (prefill 1) and 192.0.0.2 (prefill 2), and the decoder servers are 192.0.0.3 (decoder 1) and 192.0.0.4 (decoder 2). On each server, use 8 NPUs 16 chips to deploy one service instance.
Take the Deepseek-r1-w8a8 model as an example, use 4 Atlas 800T A3 servers to deploy the "2P1D" architecture. Assume the ip of the prefiller server is 192.0.0.1 (prefill 1) and 192.0.0.2 (prefill 2), and the decoder servers are 192.0.0.3 (decoder 1) and 192.0.0.4 (decoder 2). On each server, use 8 NPUs 16 chips to deploy one service instance.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a table to display recommended parallel configurations for common setups such as 1P1D and 2P1D.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants