Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 15 additions & 7 deletions docs/source/tutorials/models/DeepSeek-V3.2.md
Original file line number Diff line number Diff line change
Expand Up @@ -140,6 +140,7 @@ export VLLM_USE_V1=1
export HCCL_BUFFSIZE=200
export VLLM_ASCEND_ENABLE_MLAPO=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1

vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3.2-W8A8 \
--host 0.0.0.0 \
Expand All @@ -157,6 +158,7 @@ vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3.2-W8A8 \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.92 \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
--additional-config '{"layer_sharding": ["q_b_proj", "o_proj"]}' \
--speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}'

```
Expand Down Expand Up @@ -197,6 +199,7 @@ export VLLM_USE_V1=1
export HCCL_BUFFSIZE=200
export VLLM_ASCEND_ENABLE_MLAPO=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1

vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3.2-W8A8 \
--host 0.0.0.0 \
Expand All @@ -217,6 +220,7 @@ vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3.2-W8A8 \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.92 \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
--additional-config '{"layer_sharding": ["q_b_proj", "o_proj"]}' \
--speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}'
```

Expand Down Expand Up @@ -244,6 +248,7 @@ export VLLM_USE_V1=1
export HCCL_BUFFSIZE=200
export VLLM_ASCEND_ENABLE_MLAPO=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1

vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3.2-W8A8 \
--host 0.0.0.0 \
Expand All @@ -266,6 +271,7 @@ vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3.2-W8A8 \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.92 \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
--additional-config '{"layer_sharding": ["q_b_proj", "o_proj"]}' \
--speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}'
```

Expand Down Expand Up @@ -301,7 +307,7 @@ export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
export HCCL_CONNECT_TIMEOUT=120
export HCCL_INTRA_PCIE_ENABLE=1
export HCCL_INTRA_ROCE_ENABLE=0

export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The environment variable VLLM_ASCEND_ENABLE_FLASHCOMM1 is already exported on line 306. This duplicate export is redundant and should be removed for clarity.


vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3.2-W8A8 \
--host 0.0.0.0 \
Expand All @@ -321,8 +327,9 @@ vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3.2-W8A8 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.92 \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes":[3,6,9,12,15,18,21,24,27,30,33,36,39,42,45,48]}' \
--speculative-config '{"num_speculative_tokens": 2, "method": "deepseek_mtp"}'
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes":[8, 16, 24, 32, 40, 48]}' \
--additional-config '{"layer_sharding": ["q_b_proj", "o_proj"]}' \
--speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}'

```

Expand Down Expand Up @@ -354,7 +361,7 @@ export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
export HCCL_CONNECT_TIMEOUT=120
export HCCL_INTRA_PCIE_ENABLE=1
export HCCL_INTRA_ROCE_ENABLE=0

export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The environment variable VLLM_ASCEND_ENABLE_FLASHCOMM1 is already exported on line 360. This duplicate export is redundant and should be removed for clarity.


vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3.2-W8A8 \
--host 0.0.0.0 \
Expand All @@ -376,8 +383,9 @@ vllm serve /root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3.2-W8A8 \
--trust-remote-code \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.92 \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes":[3,6,9,12,15,18,21,24,27,30,33,36,39,42,45,48]}' \
--speculative-config '{"num_speculative_tokens": 2, "method": "deepseek_mtp"}'
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes":[8, 16, 24, 32, 40, 48]}' \
--additional-config '{"layer_sharding": ["q_b_proj", "o_proj"]}' \
--speculative-config '{"num_speculative_tokens": 3, "method": "deepseek_mtp"}'

```

Expand Down Expand Up @@ -832,7 +840,7 @@ python launch_online_dp.py --dp-size 8 --tp-size 4 --dp-size-local 4 --dp-rank-s

### Request Forwarding

To set up request forwarding, run the following script on any machine :download:`load_balance_proxy_server_example.py <examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py>`
To set up request forwarding, run the following script on any machine. You can get the proxy program in the repository's examples: [load_balance_proxy_server_example.py](https://github.com/vllm-project/vllm-ascend/blob/main/examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py)

```shell
unset http_proxy
Expand Down
Loading