[P/D] Add readme for PD separation#4182
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Code Review
This pull request adds a comprehensive README for prefill/decode (PD) separation. The documentation is detailed, but there are a few areas that need improvement for clarity and correctness. I've identified a duplicated command-line argument, some incorrect parameters in a table, a potentially broken link, and inconsistencies in configuration values. Additionally, the document structure could be improved to better explain the different configuration options presented. Addressing these points will significantly enhance the usability of this guide.
| --decoder-hosts 192.0.0.3\ | ||
| --decoder-ports 8004 | ||
| --port 1999 \ | ||
| --host 192.0.0.1 \ |
| | --prefiller-hosts-num | Number of repetitions for prefiller node hosts | | ||
| | --prefiller-ports | Ports of prefiller nodes | | ||
| | --prefiller-ports-inc | Number of increments for prefiller node ports | | ||
| | --decoder-hosts | Hosts of decoder nodes | | ||
| | --decoder-hosts-num | Number of repetitions for decoder node hosts | | ||
| | --decoder-ports | Ports of decoder nodes | | ||
| | --decoder-ports-inc | Number of increments for decoder node ports | |
There was a problem hiding this comment.
The parameters --prefiller-hosts-num, --prefiller-ports-inc, --decoder-hosts-num, and --decoder-ports-inc described in this table do not appear to be supported by the load_balance_proxy_server_example.py script referenced later. This can be misleading for users. Please ensure the documentation accurately reflects the script's arguments.
| | --decoder-hosts-num | Number of repetitions for decoder node hosts | | ||
| | --decoder-ports | Ports of decoder nodes | | ||
| | --decoder-ports-inc | Number of increments for decoder node ports | | ||
| You can get the proxy program in the repository's examples, [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/v0.9.1-dev/examples/disaggregate_prefill_v1/load_balance_proxy_server_example.py) |
There was a problem hiding this comment.
This link points to a specific development branch (v0.9.1-dev) and an incorrect directory name (disaggregate_prefill_v1). This makes the link fragile and likely to break. It should be updated to point to the main branch and use the correct path for consistency with other links in this file.
| You can get the proxy program in the repository's examples, [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/v0.9.1-dev/examples/disaggregate_prefill_v1/load_balance_proxy_server_example.py) | |
| You can get the proxy program in the repository's examples, [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py) |
| ## Prefill & Decode Configuration Details | ||
|
|
||
| In the PD separation scenario, we provide a optimized configuration. |
There was a problem hiding this comment.
|
|
||
| - **decoder node** | ||
|
|
||
| 1. set HCCL_BUFFSIZE=1024 |
556e025 to
b61baf2
Compare
leo-pony
left a comment
There was a problem hiding this comment.
please don't remove llmdatadist guide, as it is also need by A5 before 2026
| We can run the following scripts to launch a server on the prefiller/decoder node, respectively. Please note that each P/D node will occupy ports ranging from kv_port to kv_port + num_chips to initialize socket listeners. To avoid any issues, port conflicts should be prevented. Additionally, ensure that each node's engine_id is uniquely assigned to avoid conflicts. | ||
|
|
||
| ### Layerwise | ||
| ### launch_online_dp.py |
There was a problem hiding this comment.
refer to the script in example, don't just put all the codes in doc.
There was a problem hiding this comment.
OK,we already refer to the script in example
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
3ac2754 to
189c704
Compare
Mooncake and llmdatadist will work with A5 at the same time. |
c691f51 to
b5fe640
Compare
b5fe640 to
f29e17d
Compare
Signed-off-by: liziyu <liziyu16@huawei.com>
|
|
||
| ## Example Proxy for Deployment | ||
|
|
||
| Run a proxy server on the same node with the prefiller service instance. You can get the proxy program in the repository's examples: [load\_balance\_proxy\_layerwise\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_layerwise_server_example.py) or [load\_balance\_proxy\_server\_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py) |
There was a problem hiding this comment.
Let's make some description on the difference between layerwise proxy and the other
There was a problem hiding this comment.
The layerwise proxy is engineered for the MooncakeLayerwiseConnector, which is configured to route inference requests to the P-node as the initial processing point.
Conversely, the non-layerwise proxy is designed for the MooncakeConnector, which directs inference requests to the D-node first.
Signed-off-by: liziyu <liziyu16@huawei.com>
### What this PR does / why we need it? Add readme for PD separation ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@2918c1b --------- Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Signed-off-by: liziyu <liziyu16@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com>
### What this PR does / why we need it? Add readme for PD separation ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@2918c1b --------- Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Signed-off-by: liziyu <liziyu16@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk>
### What this PR does / why we need it? Add readme for PD separation ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@2918c1b --------- Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Signed-off-by: liziyu <liziyu16@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk>
### What this PR does / why we need it? Add readme for PD separation ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@2918c1b --------- Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Signed-off-by: liziyu <liziyu16@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com>
### What this PR does / why we need it? Add readme for PD separation ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@2918c1b --------- Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Signed-off-by: liziyu <liziyu16@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com> Signed-off-by: tanqingshan (A) <50050625@china.huawei.com>
### What this PR does / why we need it? Add readme for PD separation ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@2918c1b --------- Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Signed-off-by: liziyu <liziyu16@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com>
### What this PR does / why we need it? Add readme for PD separation ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@2918c1b --------- Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Signed-off-by: liziyu <liziyu16@huawei.com> Co-authored-by: liziyu <liziyu16@huawei.com>
| export HCCL_OP_EXPANSION_MODE="AIV" | ||
| export VLLM_USE_V1=1 | ||
| export ASCEND_RT_VISIBLE_DEVICES=$1 | ||
| export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/mooncake:$LD_LIBRARY_PATH |
There was a problem hiding this comment.
CANN8.3rc2 now supports transit transmission, use export ASCEND_BUFFER_POOL=4:8 to open, and use export ASCEND_BUFFER_POOL=0:0 to unable.
There was a problem hiding this comment.
The environment variable ASCEND_TRANSFER_TIMEOUT sets the transmission timeout period, which should be included in the instructions.
There was a problem hiding this comment.
A section should be set up to describe the relevant environment variable configurations, such as HCCL_INTRA_ROCE_ENABLE, etc.
| --tensor-parallel-size $7 \ | ||
| --enable-expert-parallel \ | ||
| --seed 1024 \ | ||
| --served-model-name ds_r1 \ |
There was a problem hiding this comment.
Mooncake connector now supports PCP/DCP, I think the tutorial should include corresponding examples.
| --gpu-memory-utilization 0.9 \ | ||
| --gpu-memory-utilization 0.9 \ | ||
| --quantization ascend \ | ||
| --no-enable-prefix-caching \ |
There was a problem hiding this comment.
For mooncake connector, to achieve optimal performance, we should add the --async-scheduling option to enable asynchronous scheduling on the script sample.
| Modify `run_dp_template.py` on each node. | ||
| [run\_dp\_template.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/external_online_dp/run_dp_template.sh) | ||
|
|
||
| #### Layerwise |
There was a problem hiding this comment.
Considering the current situation, should we first introduce the mooncake connector and mark layerwise as experimental support?
| -it $IMAGE bash | ||
| ``` | ||
|
|
||
| ## Install Mooncake |
There was a problem hiding this comment.
We should show the mooncake commit id at the beginning of the guide.
| vLLM-Ascend now supports prefill-decode (PD) disaggregation with EP (Expert Parallel) options. This guide take one-by-one steps to verify these features with constrained resources. | ||
|
|
||
| Take the Qwen3-235B model as an example, use 4 Atlas 800T A3 servers to deploy the "2P1D" architecture. Assume the ip of the prefiller server is 192.0.0.1 (prefill 1) and 192.0.0.2 (prefill 2), and the decoder servers are 192.0.0.3 (decoder 1) and 192.0.0.4 (decoder 2). On each server, use 8 NPUs 16 chips to deploy one service instance. | ||
| Take the Deepseek-r1-w8a8 model as an example, use 4 Atlas 800T A3 servers to deploy the "2P1D" architecture. Assume the ip of the prefiller server is 192.0.0.1 (prefill 1) and 192.0.0.2 (prefill 2), and the decoder servers are 192.0.0.3 (decoder 1) and 192.0.0.4 (decoder 2). On each server, use 8 NPUs 16 chips to deploy one service instance. |
There was a problem hiding this comment.
We should add a table to display recommended parallel configurations for common setups such as 1P1D and 2P1D.
What this PR does / why we need it?
Add readme for PD separation
Does this PR introduce any user-facing change?
No
How was this patch tested?
By ci