Handle with_prefill_across_dp for multistream mla by sdmyzlp · Pull Request #1322 · vllm-project/vllm-ascend

sdmyzlp · 2025-06-20T10:02:25Z

What this PR does / why we need it?

After #1094, decode might be executed with non-compiled mode, despite of torchair_graph_config.enabled, causing multistream mla to fail, which assumes torchair compiled mode for decode when torchair_graph_config.enabled == True.
Augment that assumption to fix this.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Tested both offline, and by graph mode mla e2e testcase.

ApsarasX · 2025-06-21T04:28:07Z

LGTM

codecov · 2025-06-23T08:26:22Z

Codecov Report

❌ Patch coverage is 0% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 27.22%. Comparing base (c30ddb8) to head (d4acec1).
⚠️ Report is 550 commits behind head on main.

Files with missing lines	Patch %	Lines
vllm_ascend/models/deepseek_v2.py	0.00%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1322      +/-   ##
==========================================
- Coverage   27.39%   27.22%   -0.17%     
==========================================
  Files          56       56              
  Lines        6191     6214      +23     
==========================================
- Hits         1696     1692       -4     
- Misses       4495     4522      +27

Flag	Coverage Δ
unittests	`27.22% <0.00%> (-0.17%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions · 2025-06-25T03:17:09Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: sdmyzlp <lrwei2@petalmail.com>

Yikun

some comments inline feel free to open new PR to address.

Yikun · 2025-06-25T23:41:37Z

vllm_ascend/models/deepseek_v2.py

            hidden_states: torch.Tensor,
            kv_cache: Optional[torch.Tensor] = None,
            attn_metadata: Optional[AttentionMetadata] = None) -> torch.Tensor:
+        enable_multistream_mla = (self.enable_multistream_mla


nit: we'd better log here to notify users why enable_multistream_mla not enable here and add some code comment about this.

Yikun · 2025-06-25T23:43:25Z

cc @ganyi1996ppo pls

### What this PR does / why we need it? After vllm-project#1094, decode might be executed with non-compiled mode, despite of `torchair_graph_config.enabled`, causing multistream mla to fail, which assumes torchair compiled mode for decode when `torchair_graph_config.enabled == True`. Augment that assumption to fix this. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tested both offline, and by graph mode mla e2e testcase. --------- Signed-off-by: sdmyzlp <lrwei2@petalmail.com>

### What this PR does / why we need it? > Need to merge after PR #1322 According to benchmark results, this PR brings approximately 1% performance gain. #### Before Improvement Profiling <img width="1147" alt="截屏2025-06-22 14 54 47" src="https://github.com/user-attachments/assets/4a4dc7f1-5b76-45d5-864d-dd7f8faf993c" /> Evaluation ``` # server launch command python -m vllm.entrypoints.openai.api_server --model=/DeepSeek-R1-W8A8 \ --quantization ascend \ --served-model-name auto \ --trust-remote-code \ --distributed-executor-backend=mp \ --port 8006 \ -tp=16 \ --max-num-seqs 24 \ --max-model-len 32768 \ --max-num-batched-tokens 8192 \ --block-size 128 \ --no-enable-prefix-caching \ --additional-config '{"torchair_graph_config":{"enable_multistream_mla": true,"enabled":true,"use_cached_graph":true,"graph_batch_sizes":[24]},"ascend_scheduler_config":{"enabled":true},"expert_tensor_parallel_size":16}' \ --gpu-memory-utilization 0.96 # client benchmark command python /root/vllm/benchmarks/benchmark_serving.py --backend vllm --dataset-name random \ --random-input-len 4096 \ --random-output-len 1536 \ --num-prompts 200 \ --ignore-eos \ --model auto \ --tokenizer /DeepSeek-R1-W8A8 \ --port 8006 \ --request-rate 1 \ --max-concurrency 24 \ --save-result \ --skip-initial-test \ --metric-percentiles "50,90,99" ``` ``` ============ Serving Benchmark Result ============ Successful requests: 200 Benchmark duration (s): 958.59 Total input tokens: 819200 Total generated tokens: 307200 Request throughput (req/s): 0.2086 Output token throughput (tok/s): 320.47 Total Token throughput (tok/s): 1175.05 ---------------Time to First Token---------------- Mean TTFT (ms): 942.70 Median TTFT (ms): 713.87 P50 TTFT (ms): 713.87 P90 TTFT (ms): 1363.88 P99 TTFT (ms): 2008.73 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 68.96 Median TPOT (ms): 69.49 P50 TPOT (ms): 69.49 P90 TPOT (ms): 70.42 P99 TPOT (ms): 70.72 ---------------Inter-token Latency---------------- Mean ITL (ms): 68.96 Median ITL (ms): 59.88 P50 ITL (ms): 59.88 P90 ITL (ms): 61.59 P99 ITL (ms): 68.82 ================================================== ``` #### After Improvement Profiling <img width="1200" alt="截屏2025-06-22 14 55 42" src="https://github.com/user-attachments/assets/e3eb9dec-0ff0-4e5f-ab94-93c65003e51f" /> Evaluation ``` ============ Serving Benchmark Result ============ Successful requests: 200 Benchmark duration (s): 948.08 Total input tokens: 819200 Total generated tokens: 307200 Request throughput (req/s): 0.2110 Output token throughput (tok/s): 324.02 Total Token throughput (tok/s): 1188.08 ---------------Time to First Token---------------- Mean TTFT (ms): 1019.25 Median TTFT (ms): 714.63 P50 TTFT (ms): 714.63 P90 TTFT (ms): 1367.31 P99 TTFT (ms): 2661.52 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 68.14 Median TPOT (ms): 68.68 P50 TPOT (ms): 68.68 P90 TPOT (ms): 69.33 P99 TPOT (ms): 70.30 ---------------Inter-token Latency---------------- Mean ITL (ms): 68.14 Median ITL (ms): 59.04 P50 ITL (ms): 59.04 P90 ITL (ms): 60.93 P99 ITL (ms): 66.89 ================================================== ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.9.2 - vLLM main: vllm-project/vllm@65393ee Signed-off-by: ApsarasX <apsarax@outlook.com>

### What this PR does / why we need it? After vllm-project#1094, decode might be executed with non-compiled mode, despite of `torchair_graph_config.enabled`, causing multistream mla to fail, which assumes torchair compiled mode for decode when `torchair_graph_config.enabled == True`. Augment that assumption to fix this. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tested both offline, and by graph mode mla e2e testcase. --------- Signed-off-by: sdmyzlp <lrwei2@petalmail.com>

### What this PR does / why we need it? > Need to merge after PR vllm-project#1322 According to benchmark results, this PR brings approximately 1% performance gain. #### Before Improvement Profiling <img width="1147" alt="截屏2025-06-22 14 54 47" src="https://github.com/user-attachments/assets/4a4dc7f1-5b76-45d5-864d-dd7f8faf993c" /> Evaluation ``` # server launch command python -m vllm.entrypoints.openai.api_server --model=/DeepSeek-R1-W8A8 \ --quantization ascend \ --served-model-name auto \ --trust-remote-code \ --distributed-executor-backend=mp \ --port 8006 \ -tp=16 \ --max-num-seqs 24 \ --max-model-len 32768 \ --max-num-batched-tokens 8192 \ --block-size 128 \ --no-enable-prefix-caching \ --additional-config '{"torchair_graph_config":{"enable_multistream_mla": true,"enabled":true,"use_cached_graph":true,"graph_batch_sizes":[24]},"ascend_scheduler_config":{"enabled":true},"expert_tensor_parallel_size":16}' \ --gpu-memory-utilization 0.96 # client benchmark command python /root/vllm/benchmarks/benchmark_serving.py --backend vllm --dataset-name random \ --random-input-len 4096 \ --random-output-len 1536 \ --num-prompts 200 \ --ignore-eos \ --model auto \ --tokenizer /DeepSeek-R1-W8A8 \ --port 8006 \ --request-rate 1 \ --max-concurrency 24 \ --save-result \ --skip-initial-test \ --metric-percentiles "50,90,99" ``` ``` ============ Serving Benchmark Result ============ Successful requests: 200 Benchmark duration (s): 958.59 Total input tokens: 819200 Total generated tokens: 307200 Request throughput (req/s): 0.2086 Output token throughput (tok/s): 320.47 Total Token throughput (tok/s): 1175.05 ---------------Time to First Token---------------- Mean TTFT (ms): 942.70 Median TTFT (ms): 713.87 P50 TTFT (ms): 713.87 P90 TTFT (ms): 1363.88 P99 TTFT (ms): 2008.73 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 68.96 Median TPOT (ms): 69.49 P50 TPOT (ms): 69.49 P90 TPOT (ms): 70.42 P99 TPOT (ms): 70.72 ---------------Inter-token Latency---------------- Mean ITL (ms): 68.96 Median ITL (ms): 59.88 P50 ITL (ms): 59.88 P90 ITL (ms): 61.59 P99 ITL (ms): 68.82 ================================================== ``` #### After Improvement Profiling <img width="1200" alt="截屏2025-06-22 14 55 42" src="https://github.com/user-attachments/assets/e3eb9dec-0ff0-4e5f-ab94-93c65003e51f" /> Evaluation ``` ============ Serving Benchmark Result ============ Successful requests: 200 Benchmark duration (s): 948.08 Total input tokens: 819200 Total generated tokens: 307200 Request throughput (req/s): 0.2110 Output token throughput (tok/s): 324.02 Total Token throughput (tok/s): 1188.08 ---------------Time to First Token---------------- Mean TTFT (ms): 1019.25 Median TTFT (ms): 714.63 P50 TTFT (ms): 714.63 P90 TTFT (ms): 1367.31 P99 TTFT (ms): 2661.52 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 68.14 Median TPOT (ms): 68.68 P50 TPOT (ms): 68.68 P90 TPOT (ms): 69.33 P99 TPOT (ms): 70.30 ---------------Inter-token Latency---------------- Mean ITL (ms): 68.14 Median ITL (ms): 59.04 P50 ITL (ms): 59.04 P90 ITL (ms): 60.93 P99 ITL (ms): 66.89 ================================================== ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.9.2 - vLLM main: vllm-project/vllm@65393ee Signed-off-by: ApsarasX <apsarax@outlook.com>

### What this PR does / why we need it? After vllm-project#1094, decode might be executed with non-compiled mode, despite of `torchair_graph_config.enabled`, causing multistream mla to fail, which assumes torchair compiled mode for decode when `torchair_graph_config.enabled == True`. Augment that assumption to fix this. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tested both offline, and by graph mode mla e2e testcase. --------- Signed-off-by: sdmyzlp <lrwei2@petalmail.com>

### What this PR does / why we need it? > Need to merge after PR vllm-project#1322 According to benchmark results, this PR brings approximately 1% performance gain. #### Before Improvement Profiling <img width="1147" alt="截屏2025-06-22 14 54 47" src="https://github.com/user-attachments/assets/4a4dc7f1-5b76-45d5-864d-dd7f8faf993c" /> Evaluation ``` # server launch command python -m vllm.entrypoints.openai.api_server --model=/DeepSeek-R1-W8A8 \ --quantization ascend \ --served-model-name auto \ --trust-remote-code \ --distributed-executor-backend=mp \ --port 8006 \ -tp=16 \ --max-num-seqs 24 \ --max-model-len 32768 \ --max-num-batched-tokens 8192 \ --block-size 128 \ --no-enable-prefix-caching \ --additional-config '{"torchair_graph_config":{"enable_multistream_mla": true,"enabled":true,"use_cached_graph":true,"graph_batch_sizes":[24]},"ascend_scheduler_config":{"enabled":true},"expert_tensor_parallel_size":16}' \ --gpu-memory-utilization 0.96 # client benchmark command python /root/vllm/benchmarks/benchmark_serving.py --backend vllm --dataset-name random \ --random-input-len 4096 \ --random-output-len 1536 \ --num-prompts 200 \ --ignore-eos \ --model auto \ --tokenizer /DeepSeek-R1-W8A8 \ --port 8006 \ --request-rate 1 \ --max-concurrency 24 \ --save-result \ --skip-initial-test \ --metric-percentiles "50,90,99" ``` ``` ============ Serving Benchmark Result ============ Successful requests: 200 Benchmark duration (s): 958.59 Total input tokens: 819200 Total generated tokens: 307200 Request throughput (req/s): 0.2086 Output token throughput (tok/s): 320.47 Total Token throughput (tok/s): 1175.05 ---------------Time to First Token---------------- Mean TTFT (ms): 942.70 Median TTFT (ms): 713.87 P50 TTFT (ms): 713.87 P90 TTFT (ms): 1363.88 P99 TTFT (ms): 2008.73 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 68.96 Median TPOT (ms): 69.49 P50 TPOT (ms): 69.49 P90 TPOT (ms): 70.42 P99 TPOT (ms): 70.72 ---------------Inter-token Latency---------------- Mean ITL (ms): 68.96 Median ITL (ms): 59.88 P50 ITL (ms): 59.88 P90 ITL (ms): 61.59 P99 ITL (ms): 68.82 ================================================== ``` #### After Improvement Profiling <img width="1200" alt="截屏2025-06-22 14 55 42" src="https://github.com/user-attachments/assets/e3eb9dec-0ff0-4e5f-ab94-93c65003e51f" /> Evaluation ``` ============ Serving Benchmark Result ============ Successful requests: 200 Benchmark duration (s): 948.08 Total input tokens: 819200 Total generated tokens: 307200 Request throughput (req/s): 0.2110 Output token throughput (tok/s): 324.02 Total Token throughput (tok/s): 1188.08 ---------------Time to First Token---------------- Mean TTFT (ms): 1019.25 Median TTFT (ms): 714.63 P50 TTFT (ms): 714.63 P90 TTFT (ms): 1367.31 P99 TTFT (ms): 2661.52 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 68.14 Median TPOT (ms): 68.68 P50 TPOT (ms): 68.68 P90 TPOT (ms): 69.33 P99 TPOT (ms): 70.30 ---------------Inter-token Latency---------------- Mean ITL (ms): 68.14 Median ITL (ms): 59.04 P50 ITL (ms): 59.04 P90 ITL (ms): 60.93 P99 ITL (ms): 66.89 ================================================== ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.9.2 - vLLM main: vllm-project/vllm@65393ee Signed-off-by: ApsarasX <apsarax@outlook.com>

ApsarasX approved these changes Jun 21, 2025

View reviewed changes

ApsarasX mentioned this pull request Jun 22, 2025

[Perf] Improve MLA multistream performance #1353

Merged

sdmyzlp force-pushed the br_handle_prefill_with_dp branch from 62583c9 to 9eab500 Compare June 23, 2025 07:59

sdmyzlp force-pushed the br_handle_prefill_with_dp branch from 9eab500 to da07c7f Compare June 24, 2025 23:40

github-actions bot added the module:tests label Jun 24, 2025

sdmyzlp force-pushed the br_handle_prefill_with_dp branch 2 times, most recently from 5e4c102 to 46d5591 Compare June 25, 2025 02:24

github-actions bot added the merge-conflicts label Jun 25, 2025

sdmyzlp force-pushed the br_handle_prefill_with_dp branch from 46d5591 to ccff3c0 Compare June 25, 2025 03:52

github-actions bot removed the merge-conflicts label Jun 25, 2025

wangxiyuan approved these changes Jun 25, 2025

View reviewed changes

sdmyzlp force-pushed the br_handle_prefill_with_dp branch 2 times, most recently from de8d84f to 3fbeefe Compare June 25, 2025 11:24

sdmyzlp added 2 commits June 25, 2025 19:45

Handle with_prefill_across_dp for multistream mla

960ce13

Signed-off-by: sdmyzlp <lrwei2@petalmail.com>

Add e2e test for multistream mla

d4acec1

Signed-off-by: sdmyzlp <lrwei2@petalmail.com>

sdmyzlp force-pushed the br_handle_prefill_with_dp branch from 3fbeefe to d4acec1 Compare June 25, 2025 11:46

Yikun approved these changes Jun 25, 2025

View reviewed changes

Yikun merged commit 53c2d58 into vllm-project:main Jun 26, 2025
24 checks passed

Yikun added long-term-test enable long term test for PR ready-for-test start test by label for PR labels Jun 26, 2025

Yikun mentioned this pull request Jun 29, 2025

[0.9.1][cherry-pick] Backport multistream MLA fixes and TP communication optimizations #1474

Closed

Yikun mentioned this pull request Sep 20, 2025

[Bug]: Remove outofdate commits to improve perf test #3051

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle with_prefill_across_dp for multistream mla#1322

Handle with_prefill_across_dp for multistream mla#1322
Yikun merged 2 commits intovllm-project:mainfrom
sdmyzlp:br_handle_prefill_with_dp

sdmyzlp commented Jun 20, 2025 •

edited

Loading

Uh oh!

ApsarasX commented Jun 21, 2025

Uh oh!

codecov bot commented Jun 23, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jun 25, 2025

Uh oh!

Yikun left a comment

Uh oh!

Yikun Jun 25, 2025 •

edited

Loading

Uh oh!

Yikun commented Jun 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

sdmyzlp commented Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

ApsarasX commented Jun 21, 2025

Uh oh!

codecov bot commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Jun 25, 2025

Uh oh!

Yikun left a comment

Choose a reason for hiding this comment

Uh oh!

Yikun Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Yikun commented Jun 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sdmyzlp commented Jun 20, 2025 •

edited

Loading

codecov bot commented Jun 23, 2025 •

edited

Loading

Yikun Jun 25, 2025 •

edited

Loading