[Doc] add long_sequence feature user guide by LookAround0301 · Pull Request #5343 · vllm-project/vllm-ascend

LookAround0301 · 2025-12-25T02:44:16Z

What this PR does / why we need it?

add long_sequence feature user guide

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: release/v0.13.0
vLLM main: vllm-project/vllm@bc0a5a0

Signed-off-by: LookAround <lixushi@huawei.com>

gemini-code-assist

Code Review

This pull request adds a new user guide for the long-sequence context parallel feature. My review focuses on ensuring the instructions in the guide are correct, complete, and will not lead to user error. I have identified several high-severity issues, including an incorrect Docker command that doesn't match the specified hardware, a performance benchmark command that doesn't test the feature being introduced, and missing information required to understand a key configuration constraint. Addressing these points is crucial for the guide to be accurate and useful for users.

gemini-code-assist · 2025-12-25T02:46:20Z

+  docker run --rm \
+  --name $NAME \
+  --net=host \
+  --shm-size=1g \
+  --device /dev/davinci0 \
+  --device /dev/davinci1 \
+  --device /dev/davinci2 \
+  --device /dev/davinci3 \
+  --device /dev/davinci4 \
+  --device /dev/davinci5 \
+  --device /dev/davinci6 \
+  --device /dev/davinci7 \
+  --device /dev/davinci_manager \
+  --device /dev/devmm_svm \
+  --device /dev/hisi_hdc \
+  -v /usr/local/dcmi:/usr/local/dcmi \
+  -v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
+  -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
+  -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
+  -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
+  -v /etc/ascend_install.info:/etc/ascend_install.info \
+  -it $IMAGE bash


The guide is for an Atlas 800 A3 server, which has 16 NPUs. However, the docker run command only maps devices /dev/davinci0 through /dev/davinci7, which is appropriate for an 8-NPU machine (like Atlas A2). This is inconsistent with the guide's context and will lead to underutilization of the hardware and user confusion. The command should be updated to include all 16 devices for an Atlas A3 server, as indicated in the comment on line 28.

docker run --rm \ --name $NAME \ --net=host \ --shm-size=1g \ --device /dev/davinci0 \ --device /dev/davinci1 \ --device /dev/davinci2 \ --device /dev/davinci3 \ --device /dev/davinci4 \ --device /dev/davinci5 \ --device /dev/davinci6 \ --device /dev/davinci7 \ --device /dev/davinci8 \ --device /dev/davinci9 \ --device /dev/davinci10 \ --device /dev/davinci11 \ --device /dev/davinci12 \ --device /dev/davinci13 \ --device /dev/davinci14 \ --device /dev/davinci15 \ --device /dev/davinci_manager \ --device /dev/devmm_svm \ --device /dev/hisi_hdc \ -v /usr/local/dcmi:/usr/local/dcmi \ -v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \ -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \ -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \ -v /etc/ascend_install.info:/etc/ascend_install.info \ -it $IMAGE bash

gemini-code-assist · 2025-12-25T02:46:20Z

+
+**Notice:**
+- tp_size needs to be divisible by dcp_size
+- decode context parallel size must less than or equal to max_dcp_size, where max_dcp_size = tensor_parallel_size // total_num_kv_heads.


The guide provides an important constraint: decode context parallel size must less than or equal to max_dcp_size, where max_dcp_size = tensor_parallel_size // total_num_kv_heads. However, the value of total_num_kv_heads for the example model Qwen3-235B-A22B-w8a8 is not mentioned. This prevents users from verifying that the example command is correct and makes it difficult for them to adapt the guide for other models. Please provide the value of total_num_kv_heads for this model to ensure the constraint is understandable and verifiable.

gemini-code-assist · 2025-12-25T02:46:20Z

+
+```shell
+export VLLM_USE_MODELSCOPE=true
+vllm bench serve --model vllm-ascend/Qwen3-235B-A22B-w8a8  --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./


The vllm bench serve command provided for performance evaluation does not include the long-sequence context parallel parameters (--tensor-parallel-size, --prefill-context-parallel-size, etc.) that are the focus of this guide. As a result, it benchmarks a default configuration instead of the specific feature being explained. To provide a relevant performance measurement, the benchmark command should mirror the deployment command's configuration.

Suggested change

vllm bench serve --model vllm-ascend/Qwen3-235B-A22B-w8a8 --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./

vllm bench serve --model vllm-ascend/Qwen3-235B-A22B-w8a8 --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./ --tensor-parallel-size 8 --prefill-context-parallel-size 2 --decode-context-parallel-size 2 --quantization ascend --max-model-len 133000 --enable-expert-parallel --trust-remote-code --hf-overrides '{"rope_parameters": {"rope_type":"yarn","rope_theta":1000000,"factor":4,"original_max_position_embeddings":32768}}'

github-actions · 2025-12-25T03:42:57Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Signed-off-by: LookAround <lixushi@huawei.com>

LookAround0301 · 2025-12-26T01:44:13Z

+
+vLLM-Ascend now supports long-sequence context parallel. This guide takes one-by-one steps to verify these features with constrained resources.
+
+Using the `Qwen3-235B-A22B-w8a8`(Quantized version) model as an example, use vllm-ascend:0.12.0rc2 (with vLLM v0.13.0) 1 Atlas 800 A3 (64G × 16) server to deploy the single node "long sequence" architecture.


修改支持混部架构

Signed-off-by: LookAround <lixushi@huawei.com>

…to eplb_refactor * 'main' of https://github.com/vllm-project/vllm-ascend: (46 commits) [Feature] Support to use fullgraph with eagle (vllm-project#5118) [EPLB][refactor] Modification of the initialization logic for expert_map and log2phy（depend on pr5285） (vllm-project#5311) [Refactor]6/N Extract common code of class AscendMLAImpl (vllm-project#5314) [Refactor] cache cos/sin in mla & remove parameter model in builder. (vllm-project#5277) update vllm pin to 12.27 (vllm-project#5412) [ReleaseNote] Add release note for v0.13.0rc1 (vllm-project#5334) [Bugfix] Correctly handle the output shape in multimodal attention (vllm-project#5443) Fix nightly (vllm-project#5413) [bugfix] fix typo of _skip_all_reduce_across_dp_group (vllm-project#5435) [Doc]modify pcp tutorial doc (vllm-project#5440) [Misc] fast fail for exiting if tools/install_flash_infer_attention_score_ops_a2.sh (vllm-project#5422) [Doc] Update DeepSeek V3.1/R1 2P1D doc (vllm-project#5387) [DOC]Fix model weight download links (vllm-project#5436) [Doc] Modify DeepSeek-R1/V3.1 documentation (vllm-project#5426) Revert "[feat] enable hierarchical mc2 ops on A2 by default (vllm-project#5300)" (vllm-project#5434) [Bugfix] fix greedy temperature detection (vllm-project#5417) [doc] Update Qwen3-235B doc for reproducing latest performance (vllm-project#5323) [feat] enable hierarchical mc2 ops on A2 by default (vllm-project#5300) [Doc] delete environment variable HCCL_OP_EXPANSION_MODE in DeepSeekV3.1/R1 (vllm-project#5419) [Doc] add long_sequence feature user guide (vllm-project#5343) ...

### What this PR does / why we need it? add long_sequence feature user guide - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@bc0a5a0 --------- Signed-off-by: LookAround <lixushi@huawei.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

### What this PR does / why we need it? add long_sequence feature user guide - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@bc0a5a0 --------- Signed-off-by: LookAround <lixushi@huawei.com>

### What this PR does / why we need it? add long_sequence feature user guide - vLLM version: release/v0.13.0 - vLLM main: vllm-project/vllm@bc0a5a0 --------- Signed-off-by: LookAround <lixushi@huawei.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

add long_sequence feature user guide

3292a03

Signed-off-by: LookAround <lixushi@huawei.com>

gemini-code-assist bot reviewed Dec 25, 2025

View reviewed changes

MengqingCao changed the title ~~add long_sequence feature user guide~~ [Doc] add long_sequence feature user guide Dec 25, 2025

github-actions bot added the documentation Improvements or additions to documentation label Dec 25, 2025

bug fix

aea687f

Signed-off-by: LookAround <lixushi@huawei.com>

LookAround0301 commented Dec 26, 2025

View reviewed changes

bug fix

a201ab5

Signed-off-by: LookAround <lixushi@huawei.com>

MengqingCao mentioned this pull request Dec 26, 2025

[Release]: Release checklist for v0.13.0rc1 #5229

Closed

46 tasks

LookAround0301 added 6 commits December 26, 2025 20:16

bug fix

a9c0c2f

Signed-off-by: LookAround <lixushi@huawei.com>

bug fix

dbaae44

Signed-off-by: LookAround <lixushi@huawei.com>

bug fix

fc68d27

Signed-off-by: LookAround <lixushi@huawei.com>

bug fix

b689e73

Signed-off-by: LookAround <lixushi@huawei.com>

Merge branch 'refs/heads/main' into user_guide

5de38be

bug fix

2ad9e95

Signed-off-by: LookAround <lixushi@huawei.com>

wangxiyuan merged commit ca31d68 into vllm-project:main Dec 27, 2025
10 checks passed

zhenwenqi2024 mentioned this pull request Dec 31, 2025

[RFC]: [Feature]: Context Parallelism && Sequence Parallelism #2329

Open

LookAround0301 deleted the user_guide branch January 4, 2026 06:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Doc] add long_sequence feature user guide#5343

[Doc] add long_sequence feature user guide#5343
wangxiyuan merged 9 commits intovllm-project:mainfrom
LookAround0301:user_guide

LookAround0301 commented Dec 25, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 25, 2025

Uh oh!

gemini-code-assist bot Dec 25, 2025

Uh oh!

gemini-code-assist bot Dec 25, 2025

Uh oh!

github-actions bot commented Dec 25, 2025

Uh oh!

LookAround0301 Dec 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		vLLM-Ascend now supports long-sequence context parallel. This guide takes one-by-one steps to verify these features with constrained resources.

		Using the `Qwen3-235B-A22B-w8a8`(Quantized version) model as an example, use vllm-ascend:0.12.0rc2 (with vLLM v0.13.0) 1 Atlas 800 A3 (64G × 16) server to deploy the single node "long sequence" architecture.

Conversation

LookAround0301 commented Dec 25, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 25, 2025

Uh oh!

LookAround0301 Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LookAround0301 commented Dec 25, 2025 •

edited by github-actions bot

Loading