From 3292a0307333106e363d2d00cc4b971f6be5d981 Mon Sep 17 00:00:00 2001 From: LookAround Date: Thu, 25 Dec 2025 10:42:55 +0800 Subject: [PATCH 1/8] add long_sequence feature user guide Signed-off-by: LookAround --- ...g_sequence_context_parallel_single_node.md | 180 ++++++++++++++++++ 1 file changed, 180 insertions(+) create mode 100644 docs/source/tutorials/long_sequence_context_parallel_single_node.md diff --git a/docs/source/tutorials/long_sequence_context_parallel_single_node.md b/docs/source/tutorials/long_sequence_context_parallel_single_node.md new file mode 100644 index 00000000000..dbc6f6d6071 --- /dev/null +++ b/docs/source/tutorials/long_sequence_context_parallel_single_node.md @@ -0,0 +1,180 @@ +# Long-Sequence Context Parallel (Qwen3-235B-A22B) + +## Getting Start + +vLLM-Ascend now supports long-sequence context parallel. This guide takes one-by-one steps to verify these features with constrained resources. + +Using the `Qwen3-235B-A22B-w8a8`(Quantized version) model as an example, use vllm-ascend:0.12.0rc2 (with vLLM v0.13.0) 1 Atlas 800 A3 (64G × 16) server to deploy the single node "long sequence" architecture. + +## Environment Preparation + +### Model Weight + +- `Qwen3-235B-A22B-w8a8`(Quantized version): require 1 Atlas 800 A3 (64G × 16) node. [Download model weight](https://modelscope.cn/models/vllm-ascend/Qwen3-235B-A22B-W8A8) + +It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/` + +### Installation + +:::::{tab-set} +::::{tab-item} Use docker image + +For example, using images `quay.io/ascend/vllm-ascend:v0.12.0rc2`(for Atlas 800 A3). + +Select an image based on your machine type and start the docker image on your node, refer to [using docker](../installation.md#set-up-using-docker). + +```{code-block} bash + :substitutions: + # Update --device according to your device (Atlas A2: /dev/davinci[0-7] Atlas A3:/dev/davinci[0-15]). + # Update the vllm-ascend image according to your environment. + # Note you should download the weight to /root/.cache in advance. + # Update the vllm-ascend image + export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version| + export NAME=vllm-ascend + + # Run the container using the defined variables + # Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance + docker run --rm \ + --name $NAME \ + --net=host \ + --shm-size=1g \ + --device /dev/davinci0 \ + --device /dev/davinci1 \ + --device /dev/davinci2 \ + --device /dev/davinci3 \ + --device /dev/davinci4 \ + --device /dev/davinci5 \ + --device /dev/davinci6 \ + --device /dev/davinci7 \ + --device /dev/davinci_manager \ + --device /dev/devmm_svm \ + --device /dev/hisi_hdc \ + -v /usr/local/dcmi:/usr/local/dcmi \ + -v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \ + -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ + -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \ + -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \ + -v /etc/ascend_install.info:/etc/ascend_install.info \ + -it $IMAGE bash +``` + +:::: +::::{tab-item} Build from source + +You can build all from source. + +- Install `vllm-ascend`, refer to [set up using python](../installation.md#set-up-using-python). + +:::: +::::: + +If you want to deploy multi-node environment, you need to set up environment on each node. + +## Deployment + +### Single-node Deployment + +`Qwen3-235B-A22B-w8a8` can be deployed on 1 Atlas 800 A3(64G*16). +Quantized version need to start with parameter `--quantization ascend`. + +Run the following script to execute online 128k inference. + +```shell +#!/bin/sh +# Load model from ModelScope to speed up download +export VLLM_USE_MODELSCOPE=true +# To reduce memory fragmentation and avoid out of memory +export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True +export HCCL_BUFFSIZE=512 +export HCCL_OP_EXPANSION_MODE="AIV" +export OMP_PROC_BIND=false +export OMP_NUM_THREADS=1 +export VLLM_ASCEND_ENABLE_FLASHCOMM1=1 +export TASK_QUEUE_ENABLE=1 + +vllm serve vllm-ascend/Qwen3-235B-A22B-w8a8 \ +--host 0.0.0.0 \ +--port 8000 \ +--tensor-parallel-size 8 \ +--prefill-context-parallel-size 2 \ +--decode-context-parallel-size 2 \ +--seed 1024 \ +--quantization ascend \ +--served-model-name qwen3 \ +--max-num-seqs 1 \ +--max-model-len 133000 \ +--max-num-batched-tokens 133000 \ +--enable-expert-parallel \ +--trust-remote-code \ +--gpu-memory-utilization 0.95 \ +--hf-overrides '{"rope_parameters": {"rope_type":"yarn","rope_theta":1000000,"factor":4,"original_max_position_embeddings":32768}}' \ +--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \ +--async-scheduling +``` + +**Notice:** +- for vllm version below `v0.12.0` use parameter: `--rope_scaling '{"rope_type":"yarn","factor":4,"original_max_position_embeddings":32768}' \` +- for vllm version `v0.12.0` use parameter: `--hf-overrides '{"rope_parameters": {"rope_type":"yarn","rope_theta":1000000,"factor":4,"original_max_position_embeddings":32768}}' \` + +The parameters are explained as follows: +- `--tensor-parallel-size` 8 are common settings for tensor parallelism (TP) sizes. +- `--prefill-context-parallel-size` 2 are common settings for prefill context parallelism PCP) sizes. +- `--decode-context-parallel-size` 2 are common settings for decode context parallelism DCP) sizes. +- `--max-model-len` represents the context length, which is the maximum value of the input plus output for a single request. +- `--max-num-seqs` indicates the maximum number of requests that each DP group is allowed to process. If the number of requests sent to the service exceeds this limit, the excess requests will remain in a waiting state and will not be scheduled. Note that the time spent in the waiting state is also counted in metrics such as TTFT and TPOT. Therefore, when testing performance, it is generally recommended that `--max-num-seqs` * `--data-parallel-size` >= the actual total concurrency. +- `--max-num-batched-tokens` represents the maximum number of tokens that the model can process in a single step. Currently, vLLM v1 scheduling enables ChunkPrefill/SplitFuse by default, which means: + - (1) If the input length of a request is greater than `--max-num-batched-tokens`, it will be divided into multiple rounds of computation according to `--max-num-batched-tokens`; + - (2) Decode requests are prioritized for scheduling, and prefill requests are scheduled only if there is available capacity. + - Generally, if `--max-num-batched-tokens` is set to a larger value, the overall latency will be lower, but the pressure on GPU memory (activation value usage) will be greater. +- `--gpu-memory-utilization` represents the proportion of HBM that vLLM will use for actual inference. Its essential function is to calculate the available kv_cache size. During the warm-up phase (referred to as profile run in vLLM), vLLM records the peak GPU memory usage during an inference process with an input size of `--max-num-batched-tokens`. The available kv_cache size is then calculated as: `--gpu-memory-utilization` * HBM size - peak GPU memory usage. Therefore, the larger the value of `--gpu-memory-utilization`, the more kv_cache can be used. However, since the GPU memory usage during the warm-up phase may differ from that during actual inference (e.g., due to uneven EP load), setting `--gpu-memory-utilization` too high may lead to OOM (Out of Memory) issues during actual inference. The default value is `0.9`. +- `--enable-expert-parallel` indicates that EP is enabled. Note that vLLM does not support a mixed approach of ETP and EP; that is, MoE can either use pure EP or pure TP. +- `--no-enable-prefix-caching` indicates that prefix caching is disabled. To enable it, remove this option. +- `--quantization` "ascend" indicates that quantization is used. To disable quantization, remove this option. +- `--compilation-config` contains configurations related to the aclgraph graph mode. The most significant configurations are "cudagraph_mode" and "cudagraph_capture_sizes", which have the following meanings: +"cudagraph_mode": represents the specific graph mode. Currently, "PIECEWISE" and "FULL_DECODE_ONLY" are supported. The graph mode is mainly used to reduce the cost of operator dispatch. Currently, "FULL_DECODE_ONLY" is recommended. +- "cudagraph_capture_sizes": represents different levels of graph modes. The default value is [1, 2, 4, 8, 16, 24, 32, 40,..., `--max-num-seqs`]. In the graph mode, the input for graphs at different levels is fixed, and inputs between levels are automatically padded to the next level. Currently, the default setting is recommended. Only in some scenarios is it necessary to set this separately to achieve optimal performance. +- `export VLLM_ASCEND_ENABLE_FLASHCOMM1=1` indicates that Flashcomm1 optimization is enabled. Currently, this optimization is only supported for MoE in scenarios where tp_size > 1. + +**Notice:** +- tp_size needs to be divisible by dcp_size +- decode context parallel size must less than or equal to max_dcp_size, where max_dcp_size = tensor_parallel_size // total_num_kv_heads. + +## Accuracy Evaluation + +Here are two accuracy evaluation methods. + +### Using AISBench + +1. Refer to [Using AISBench](../developer_guide/evaluation/using_ais_bench.md) for details. + +2. After execution, you can get the result, here is the result of `Qwen3-235B-A22B-w8a8` in `vllm-ascend:0.12.0rc2` for reference only. + +| dataset | version | metric | mode | vllm-api-general-chat | +|----------| ----- | ----- | ----- |-----------------------| +| aime2024 | - | accuracy | gen | 83.33 | + +## Performance + +### Using AISBench + +Refer to [Using AISBench for performance evaluation](../developer_guide/evaluation/using_ais_bench.md#execute-performance-evaluation) for details. + +### Using vLLM Benchmark + +Run performance evaluation of `Qwen3-235B-A22B-w8a8` as an example. + +Refer to [vllm benchmark](https://docs.vllm.ai/en/latest/contributing/benchmarks.html) for more details. + +There are three `vllm bench` subcommand: +- `latency`: Benchmark the latency of a single batch of requests. +- `serve`: Benchmark the online serving throughput. +- `throughput`: Benchmark offline inference throughput. + +Take the `serve` as an example. Run the code as follows. + +```shell +export VLLM_USE_MODELSCOPE=true +vllm bench serve --model vllm-ascend/Qwen3-235B-A22B-w8a8 --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./ +``` + +After about several minutes, you can get the performance evaluation result. From aea687fd260a7df3b5511fdb5c1031e1d6d14e68 Mon Sep 17 00:00:00 2001 From: LookAround Date: Thu, 25 Dec 2025 20:01:17 +0800 Subject: [PATCH 2/8] bug fix Signed-off-by: LookAround --- docs/source/tutorials/index.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/source/tutorials/index.md b/docs/source/tutorials/index.md index f71aa5260b5..6e2d1db90e9 100644 --- a/docs/source/tutorials/index.md +++ b/docs/source/tutorials/index.md @@ -20,6 +20,7 @@ DeepSeek-V3.1.md DeepSeek-V3.2.md DeepSeek-R1.md Kimi-K2-Thinking +long_sequence_context_parallel_single_node pd_disaggregation_mooncake_single_node pd_disaggregation_mooncake_multi_node ray From a201ab561c07209bde1897e23de60ea3c91efeae Mon Sep 17 00:00:00 2001 From: LookAround Date: Fri, 26 Dec 2025 10:13:43 +0800 Subject: [PATCH 3/8] bug fix Signed-off-by: LookAround --- .../tutorials/long_sequence_context_parallel_single_node.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/tutorials/long_sequence_context_parallel_single_node.md b/docs/source/tutorials/long_sequence_context_parallel_single_node.md index dbc6f6d6071..8ded6625f37 100644 --- a/docs/source/tutorials/long_sequence_context_parallel_single_node.md +++ b/docs/source/tutorials/long_sequence_context_parallel_single_node.md @@ -4,7 +4,7 @@ vLLM-Ascend now supports long-sequence context parallel. This guide takes one-by-one steps to verify these features with constrained resources. -Using the `Qwen3-235B-A22B-w8a8`(Quantized version) model as an example, use vllm-ascend:0.12.0rc2 (with vLLM v0.13.0) 1 Atlas 800 A3 (64G × 16) server to deploy the single node "long sequence" architecture. +Using the `Qwen3-235B-A22B-w8a8`(Quantized version) model as an example, use vllm-ascend:0.12.0rc2 (with vLLM v0.13.0) 1 Atlas 800 A3 (64G × 16) server to deploy the single node "pd co-locate" architecture. ## Environment Preparation From a9c0c2ffb0133ef3a60f483cfffb94677472db23 Mon Sep 17 00:00:00 2001 From: LookAround Date: Fri, 26 Dec 2025 20:16:39 +0800 Subject: [PATCH 4/8] bug fix Signed-off-by: LookAround --- ...g_sequence_context_parallel_single_node.md | 136 +++++++++--------- 1 file changed, 65 insertions(+), 71 deletions(-) diff --git a/docs/source/tutorials/long_sequence_context_parallel_single_node.md b/docs/source/tutorials/long_sequence_context_parallel_single_node.md index 8ded6625f37..9288c5ee11d 100644 --- a/docs/source/tutorials/long_sequence_context_parallel_single_node.md +++ b/docs/source/tutorials/long_sequence_context_parallel_single_node.md @@ -4,7 +4,7 @@ vLLM-Ascend now supports long-sequence context parallel. This guide takes one-by-one steps to verify these features with constrained resources. -Using the `Qwen3-235B-A22B-w8a8`(Quantized version) model as an example, use vllm-ascend:0.12.0rc2 (with vLLM v0.13.0) 1 Atlas 800 A3 (64G × 16) server to deploy the single node "pd co-locate" architecture. +Using the `Qwen3-235B-A22B-w8a8`(Quantized version) model as an example, use vllm-ascend:|vllm_ascend_version| 1 Atlas 800 A3 (64G × 16) server to deploy the single node "pd co-locate" architecture. ## Environment Preparation @@ -14,62 +14,51 @@ Using the `Qwen3-235B-A22B-w8a8`(Quantized version) model as an example, use vll It is recommended to download the model weight to the shared directory of multiple nodes, such as `/root/.cache/` -### Installation - -:::::{tab-set} -::::{tab-item} Use docker image - -For example, using images `quay.io/ascend/vllm-ascend:v0.12.0rc2`(for Atlas 800 A3). - -Select an image based on your machine type and start the docker image on your node, refer to [using docker](../installation.md#set-up-using-docker). +### Run with Docker +Start a Docker container on each node. ```{code-block} bash - :substitutions: - # Update --device according to your device (Atlas A2: /dev/davinci[0-7] Atlas A3:/dev/davinci[0-15]). - # Update the vllm-ascend image according to your environment. - # Note you should download the weight to /root/.cache in advance. - # Update the vllm-ascend image - export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version| - export NAME=vllm-ascend - - # Run the container using the defined variables - # Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance - docker run --rm \ - --name $NAME \ - --net=host \ - --shm-size=1g \ - --device /dev/davinci0 \ - --device /dev/davinci1 \ - --device /dev/davinci2 \ - --device /dev/davinci3 \ - --device /dev/davinci4 \ - --device /dev/davinci5 \ - --device /dev/davinci6 \ - --device /dev/davinci7 \ - --device /dev/davinci_manager \ - --device /dev/devmm_svm \ - --device /dev/hisi_hdc \ - -v /usr/local/dcmi:/usr/local/dcmi \ - -v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \ - -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ - -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \ - -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \ - -v /etc/ascend_install.info:/etc/ascend_install.info \ - -it $IMAGE bash + :substitutions: +# Update the vllm-ascend image +export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version| +export NAME=vllm-ascend + +# Run the container using the defined variables +# Note: If you are running bridge network with docker, please expose available ports for multiple nodes communication in advance +docker run --rm \ +--name $NAME \ +--net=host \ +--shm-size=1g \ +--device /dev/davinci0 \ +--device /dev/davinci1 \ +--device /dev/davinci2 \ +--device /dev/davinci3 \ +--device /dev/davinci4 \ +--device /dev/davinci5 \ +--device /dev/davinci6 \ +--device /dev/davinci7 \ +--device /dev/davinci8 \ +--device /dev/davinci9 \ +--device /dev/davinci10 \ +--device /dev/davinci11 \ +--device /dev/davinci12 \ +--device /dev/davinci13 \ +--device /dev/davinci14 \ +--device /dev/davinci15 \ +--device /dev/davinci_manager \ +--device /dev/devmm_svm \ +--device /dev/hisi_hdc \ +-v /usr/local/dcmi:/usr/local/dcmi \ +-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \ +-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ +-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \ +-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \ +-v /etc/ascend_install.info:/etc/ascend_install.info \ +-v /etc/hccn.conf:/etc/hccn.conf \ +-v /mnt/sfs_turbo/.cache:/root/.cache \ +-it $IMAGE bash ``` -:::: -::::{tab-item} Build from source - -You can build all from source. - -- Install `vllm-ascend`, refer to [set up using python](../installation.md#set-up-using-python). - -:::: -::::: - -If you want to deploy multi-node environment, you need to set up environment on each node. - ## Deployment ### Single-node Deployment @@ -93,23 +82,23 @@ export VLLM_ASCEND_ENABLE_FLASHCOMM1=1 export TASK_QUEUE_ENABLE=1 vllm serve vllm-ascend/Qwen3-235B-A22B-w8a8 \ ---host 0.0.0.0 \ ---port 8000 \ ---tensor-parallel-size 8 \ ---prefill-context-parallel-size 2 \ ---decode-context-parallel-size 2 \ ---seed 1024 \ ---quantization ascend \ ---served-model-name qwen3 \ ---max-num-seqs 1 \ ---max-model-len 133000 \ ---max-num-batched-tokens 133000 \ ---enable-expert-parallel \ ---trust-remote-code \ ---gpu-memory-utilization 0.95 \ ---hf-overrides '{"rope_parameters": {"rope_type":"yarn","rope_theta":1000000,"factor":4,"original_max_position_embeddings":32768}}' \ ---compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \ ---async-scheduling + --host 0.0.0.0 \ + --port 8000 \ + --tensor-parallel-size 8 \ + --prefill-context-parallel-size 2 \ + --decode-context-parallel-size 2 \ + --seed 1024 \ + --quantization ascend \ + --served-model-name qwen3 \ + --max-num-seqs 1 \ + --max-model-len 133000 \ + --max-num-batched-tokens 133000 \ + --enable-expert-parallel \ + --trust-remote-code \ + --gpu-memory-utilization 0.95 \ + --hf-overrides '{"rope_parameters": {"rope_type":"yarn","rope_theta":1000000,"factor":4,"original_max_position_embeddings":32768}}' \ + --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \ + --async-scheduling ``` **Notice:** @@ -147,7 +136,7 @@ Here are two accuracy evaluation methods. 1. Refer to [Using AISBench](../developer_guide/evaluation/using_ais_bench.md) for details. -2. After execution, you can get the result, here is the result of `Qwen3-235B-A22B-w8a8` in `vllm-ascend:0.12.0rc2` for reference only. +2. After execution, you can get the result, here is the result of `Qwen3-235B-A22B-w8a8` in vllm-ascend:|vllm_ascend_version| for reference only. | dataset | version | metric | mode | vllm-api-general-chat | |----------| ----- | ----- | ----- |-----------------------| @@ -178,3 +167,8 @@ vllm bench serve --model vllm-ascend/Qwen3-235B-A22B-w8a8 --dataset-name random ``` After about several minutes, you can get the performance evaluation result. + + +| dataset | version | metric | mode | vllm-api-stream-chat | +|---------| ----- |-------------|------|----------------------| +| random | - | performance | perf | 17.36 | \ No newline at end of file From dbaae449a51a08b0ddfe5a2646fc11fa0de559eb Mon Sep 17 00:00:00 2001 From: LookAround Date: Fri, 26 Dec 2025 20:26:34 +0800 Subject: [PATCH 5/8] bug fix Signed-off-by: LookAround --- .../long_sequence_context_parallel_single_node.md | 15 ++++++++------- 1 file changed, 8 insertions(+), 7 deletions(-) diff --git a/docs/source/tutorials/long_sequence_context_parallel_single_node.md b/docs/source/tutorials/long_sequence_context_parallel_single_node.md index 9288c5ee11d..90993cd2499 100644 --- a/docs/source/tutorials/long_sequence_context_parallel_single_node.md +++ b/docs/source/tutorials/long_sequence_context_parallel_single_node.md @@ -4,7 +4,7 @@ vLLM-Ascend now supports long-sequence context parallel. This guide takes one-by-one steps to verify these features with constrained resources. -Using the `Qwen3-235B-A22B-w8a8`(Quantized version) model as an example, use vllm-ascend:|vllm_ascend_version| 1 Atlas 800 A3 (64G × 16) server to deploy the single node "pd co-locate" architecture. +Using the `Qwen3-235B-A22B-w8a8`(Quantized version) model as an example, use 1 Atlas 800 A3 (64G × 16) server to deploy the single node "pd co-locate" architecture. ## Environment Preparation @@ -80,6 +80,7 @@ export OMP_PROC_BIND=false export OMP_NUM_THREADS=1 export VLLM_ASCEND_ENABLE_FLASHCOMM1=1 export TASK_QUEUE_ENABLE=1 +export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve vllm-ascend/Qwen3-235B-A22B-w8a8 \ --host 0.0.0.0 \ @@ -97,7 +98,7 @@ vllm serve vllm-ascend/Qwen3-235B-A22B-w8a8 \ --trust-remote-code \ --gpu-memory-utilization 0.95 \ --hf-overrides '{"rope_parameters": {"rope_type":"yarn","rope_theta":1000000,"factor":4,"original_max_position_embeddings":32768}}' \ - --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \ + --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY", "cudagraph_capture_sizes":[1,2,4,8]}' \ --async-scheduling ``` @@ -136,7 +137,7 @@ Here are two accuracy evaluation methods. 1. Refer to [Using AISBench](../developer_guide/evaluation/using_ais_bench.md) for details. -2. After execution, you can get the result, here is the result of `Qwen3-235B-A22B-w8a8` in vllm-ascend:|vllm_ascend_version| for reference only. +2. After execution, you can get the result, here is the result of `Qwen3-235B-A22B-w8a8` for reference only. | dataset | version | metric | mode | vllm-api-general-chat | |----------| ----- | ----- | ----- |-----------------------| @@ -163,12 +164,12 @@ Take the `serve` as an example. Run the code as follows. ```shell export VLLM_USE_MODELSCOPE=true -vllm bench serve --model vllm-ascend/Qwen3-235B-A22B-w8a8 --dataset-name random --random-input 200 --num-prompt 200 --request-rate 1 --save-result --result-dir ./ +vllm bench serve --model vllm-ascend/Qwen3-235B-A22B-w8a8 --dataset-name random --random-input 131072 --num-prompt 1 --request-rate 1 --save-result --result-dir ./ ``` After about several minutes, you can get the performance evaluation result. -| dataset | version | metric | mode | vllm-api-stream-chat | -|---------| ----- |-------------|------|----------------------| -| random | - | performance | perf | 17.36 | \ No newline at end of file +| dataset | version | metric | mode | ttft | +|---------| ----- |-------------|------|--------| +| random | - | performance | perf | 17.36s | \ No newline at end of file From fc68d2719233b98acbcdf23c2d608808599a9051 Mon Sep 17 00:00:00 2001 From: LookAround Date: Fri, 26 Dec 2025 21:22:45 +0800 Subject: [PATCH 6/8] bug fix Signed-off-by: LookAround --- .../tutorials/long_sequence_context_parallel_single_node.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/tutorials/long_sequence_context_parallel_single_node.md b/docs/source/tutorials/long_sequence_context_parallel_single_node.md index 90993cd2499..a4d335979b6 100644 --- a/docs/source/tutorials/long_sequence_context_parallel_single_node.md +++ b/docs/source/tutorials/long_sequence_context_parallel_single_node.md @@ -92,8 +92,8 @@ vllm serve vllm-ascend/Qwen3-235B-A22B-w8a8 \ --quantization ascend \ --served-model-name qwen3 \ --max-num-seqs 1 \ - --max-model-len 133000 \ - --max-num-batched-tokens 133000 \ + --max-model-len 133008 \ + --max-num-batched-tokens 133008 \ --enable-expert-parallel \ --trust-remote-code \ --gpu-memory-utilization 0.95 \ From b689e73a2e7cb1b5e5c88baa96eeab74fe22d287 Mon Sep 17 00:00:00 2001 From: LookAround Date: Sat, 27 Dec 2025 09:59:17 +0800 Subject: [PATCH 7/8] bug fix Signed-off-by: LookAround --- .../tutorials/long_sequence_context_parallel_single_node.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/docs/source/tutorials/long_sequence_context_parallel_single_node.md b/docs/source/tutorials/long_sequence_context_parallel_single_node.md index a4d335979b6..e09f98b4f5a 100644 --- a/docs/source/tutorials/long_sequence_context_parallel_single_node.md +++ b/docs/source/tutorials/long_sequence_context_parallel_single_node.md @@ -169,7 +169,6 @@ vllm bench serve --model vllm-ascend/Qwen3-235B-A22B-w8a8 --dataset-name random After about several minutes, you can get the performance evaluation result. - | dataset | version | metric | mode | ttft | |---------| ----- |-------------|------|--------| -| random | - | performance | perf | 17.36s | \ No newline at end of file +| random | - | performance | perf | 17.36s | From 2ad9e95dc1a054c1aed0e7aa15d74f30e2136476 Mon Sep 17 00:00:00 2001 From: LookAround Date: Sat, 27 Dec 2025 10:09:23 +0800 Subject: [PATCH 8/8] bug fix Signed-off-by: LookAround --- docs/source/tutorials/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/tutorials/index.md b/docs/source/tutorials/index.md index b87758507f7..5f9242091f4 100644 --- a/docs/source/tutorials/index.md +++ b/docs/source/tutorials/index.md @@ -20,9 +20,9 @@ DeepSeek-V3.1.md DeepSeek-V3.2.md DeepSeek-R1.md Kimi-K2-Thinking -long_sequence_context_parallel_single_node pd_disaggregation_mooncake_single_node pd_disaggregation_mooncake_multi_node +long_sequence_context_parallel_single_node long_sequence_context_parallel_multi_node ray 310p