[Perf] Improve MLA multistream performance by ApsarasX · Pull Request #1353 · vllm-project/vllm-ascend

ApsarasX · 2025-06-22T07:10:14Z

What this PR does / why we need it?

Need to merge after PR #1322

According to benchmark results, this PR brings approximately 1% performance gain.

Before Improvement

Profiling

Evaluation

# server launch command
python -m vllm.entrypoints.openai.api_server --model=/DeepSeek-R1-W8A8 \
    --quantization ascend \
    --served-model-name auto \
    --trust-remote-code \
    --distributed-executor-backend=mp \
    --port 8006 \
    -tp=16 \
    --max-num-seqs 24 \
    --max-model-len 32768 \
    --max-num-batched-tokens 8192 \
    --block-size 128 \
    --no-enable-prefix-caching \
    --additional-config '{"torchair_graph_config":{"enable_multistream_mla": true,"enabled":true,"use_cached_graph":true,"graph_batch_sizes":[24]},"ascend_scheduler_config":{"enabled":true},"expert_tensor_parallel_size":16}' \
    --gpu-memory-utilization 0.96

# client benchmark command
python /root/vllm/benchmarks/benchmark_serving.py --backend vllm --dataset-name random \
        --random-input-len 4096 \
        --random-output-len 1536 \
        --num-prompts 200 \
        --ignore-eos \
        --model auto \
        --tokenizer /DeepSeek-R1-W8A8 \
        --port 8006 \
        --request-rate 1 \
        --max-concurrency 24 \
        --save-result \
        --skip-initial-test \
        --metric-percentiles "50,90,99"

============ Serving Benchmark Result ============
Successful requests:                     200       
Benchmark duration (s):                  958.59    
Total input tokens:                      819200    
Total generated tokens:                  307200    
Request throughput (req/s):              0.2086    
Output token throughput (tok/s):         320.47    
Total Token throughput (tok/s):          1175.05   
---------------Time to First Token----------------
Mean TTFT (ms):                          942.70    
Median TTFT (ms):                        713.87    
P50 TTFT (ms):                           713.87    
P90 TTFT (ms):                           1363.88   
P99 TTFT (ms):                           2008.73   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          68.96     
Median TPOT (ms):                        69.49     
P50 TPOT (ms):                           69.49     
P90 TPOT (ms):                           70.42     
P99 TPOT (ms):                           70.72     
---------------Inter-token Latency----------------
Mean ITL (ms):                           68.96     
Median ITL (ms):                         59.88     
P50 ITL (ms):                            59.88     
P90 ITL (ms):                            61.59     
P99 ITL (ms):                            68.82     
==================================================

After Improvement

Profiling

Evaluation

============ Serving Benchmark Result ============
Successful requests:                     200       
Benchmark duration (s):                  948.08    
Total input tokens:                      819200    
Total generated tokens:                  307200    
Request throughput (req/s):              0.2110    
Output token throughput (tok/s):         324.02    
Total Token throughput (tok/s):          1188.08   
---------------Time to First Token----------------
Mean TTFT (ms):                          1019.25   
Median TTFT (ms):                        714.63    
P50 TTFT (ms):                           714.63    
P90 TTFT (ms):                           1367.31   
P99 TTFT (ms):                           2661.52   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          68.14     
Median TPOT (ms):                        68.68     
P50 TPOT (ms):                           68.68     
P90 TPOT (ms):                           69.33     
P99 TPOT (ms):                           70.30     
---------------Inter-token Latency----------------
Mean ITL (ms):                           68.14     
Median ITL (ms):                         59.04     
P50 ITL (ms):                            59.04     
P90 ITL (ms):                            60.93     
P99 ITL (ms):                            66.89     
==================================================

Does this PR introduce any user-facing change?

No

How was this patch tested?

vLLM version: v0.9.2
vLLM main: vllm-project/vllm@65393ee

codecov · 2025-06-22T07:29:24Z

Codecov Report

❌ Patch coverage is 18.18182% with 18 lines in your changes missing coverage. Please review.
✅ Project coverage is 54.52%. Comparing base (c30ddb8) to head (891bd87).
⚠️ Report is 613 commits behind head on main.

Files with missing lines	Patch %	Lines
vllm_ascend/attention/mla_v1.py	18.18%	9 Missing ⚠️
vllm_ascend/utils.py	14.28%	6 Missing ⚠️
vllm_ascend/models/deepseek_v2.py	25.00%	3 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #1353       +/-   ##
===========================================
+ Coverage   27.39%   54.52%   +27.12%     
===========================================
  Files          56       80       +24     
  Lines        6191     9979     +3788     
===========================================
+ Hits         1696     5441     +3745     
- Misses       4495     4538       +43

Flag	Coverage Δ
unittests	`54.52% <18.18%> (+27.12%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions · 2025-06-25T12:13:08Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: ApsarasX <apsarax@outlook.com>

ApsarasX · 2025-07-10T13:25:05Z

@sdmyzlp plz review this PR

#1750) This PR port the optimization in PR #1353 to v0.9.1-dev. Signed-off-by: whx-sjtu <2952154980@qq.com>

### What this PR does / why we need it? > Need to merge after PR vllm-project#1322 According to benchmark results, this PR brings approximately 1% performance gain. #### Before Improvement Profiling <img width="1147" alt="截屏2025-06-22 14 54 47" src="https://github.com/user-attachments/assets/4a4dc7f1-5b76-45d5-864d-dd7f8faf993c" /> Evaluation ``` # server launch command python -m vllm.entrypoints.openai.api_server --model=/DeepSeek-R1-W8A8 \ --quantization ascend \ --served-model-name auto \ --trust-remote-code \ --distributed-executor-backend=mp \ --port 8006 \ -tp=16 \ --max-num-seqs 24 \ --max-model-len 32768 \ --max-num-batched-tokens 8192 \ --block-size 128 \ --no-enable-prefix-caching \ --additional-config '{"torchair_graph_config":{"enable_multistream_mla": true,"enabled":true,"use_cached_graph":true,"graph_batch_sizes":[24]},"ascend_scheduler_config":{"enabled":true},"expert_tensor_parallel_size":16}' \ --gpu-memory-utilization 0.96 # client benchmark command python /root/vllm/benchmarks/benchmark_serving.py --backend vllm --dataset-name random \ --random-input-len 4096 \ --random-output-len 1536 \ --num-prompts 200 \ --ignore-eos \ --model auto \ --tokenizer /DeepSeek-R1-W8A8 \ --port 8006 \ --request-rate 1 \ --max-concurrency 24 \ --save-result \ --skip-initial-test \ --metric-percentiles "50,90,99" ``` ``` ============ Serving Benchmark Result ============ Successful requests: 200 Benchmark duration (s): 958.59 Total input tokens: 819200 Total generated tokens: 307200 Request throughput (req/s): 0.2086 Output token throughput (tok/s): 320.47 Total Token throughput (tok/s): 1175.05 ---------------Time to First Token---------------- Mean TTFT (ms): 942.70 Median TTFT (ms): 713.87 P50 TTFT (ms): 713.87 P90 TTFT (ms): 1363.88 P99 TTFT (ms): 2008.73 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 68.96 Median TPOT (ms): 69.49 P50 TPOT (ms): 69.49 P90 TPOT (ms): 70.42 P99 TPOT (ms): 70.72 ---------------Inter-token Latency---------------- Mean ITL (ms): 68.96 Median ITL (ms): 59.88 P50 ITL (ms): 59.88 P90 ITL (ms): 61.59 P99 ITL (ms): 68.82 ================================================== ``` #### After Improvement Profiling <img width="1200" alt="截屏2025-06-22 14 55 42" src="https://github.com/user-attachments/assets/e3eb9dec-0ff0-4e5f-ab94-93c65003e51f" /> Evaluation ``` ============ Serving Benchmark Result ============ Successful requests: 200 Benchmark duration (s): 948.08 Total input tokens: 819200 Total generated tokens: 307200 Request throughput (req/s): 0.2110 Output token throughput (tok/s): 324.02 Total Token throughput (tok/s): 1188.08 ---------------Time to First Token---------------- Mean TTFT (ms): 1019.25 Median TTFT (ms): 714.63 P50 TTFT (ms): 714.63 P90 TTFT (ms): 1367.31 P99 TTFT (ms): 2661.52 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 68.14 Median TPOT (ms): 68.68 P50 TPOT (ms): 68.68 P90 TPOT (ms): 69.33 P99 TPOT (ms): 70.30 ---------------Inter-token Latency---------------- Mean ITL (ms): 68.14 Median ITL (ms): 59.04 P50 ITL (ms): 59.04 P90 ITL (ms): 60.93 P99 ITL (ms): 66.89 ================================================== ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.9.2 - vLLM main: vllm-project/vllm@65393ee Signed-off-by: ApsarasX <apsarax@outlook.com>

github-actions bot added the module:core label Jun 22, 2025

ApsarasX force-pushed the improve-mla-multistream branch 2 times, most recently from e932430 to eecf066 Compare June 22, 2025 07:14

github-actions bot added the merge-conflicts label Jun 25, 2025

ApsarasX force-pushed the improve-mla-multistream branch 2 times, most recently from 056c662 to def32ea Compare July 9, 2025 16:10

github-actions bot removed the merge-conflicts label Jul 9, 2025

ApsarasX force-pushed the improve-mla-multistream branch from def32ea to c3bed8a Compare July 10, 2025 03:09

[Perf] Improve MLA multistream performance

891bd87

Signed-off-by: ApsarasX <apsarax@outlook.com>

ApsarasX force-pushed the improve-mla-multistream branch from c3bed8a to 891bd87 Compare July 10, 2025 11:24

wangxiyuan approved these changes Jul 10, 2025

View reviewed changes

ApsarasX added the ready read for review label Jul 10, 2025

wangxiyuan merged commit 0fc9b56 into vllm-project:main Jul 11, 2025
27 checks passed

whx-sjtu mentioned this pull request Jul 11, 2025

[0.9.1][Perf] Port MLA multistream optimazition and prefetch to v0.9.1 #1750

Merged

ganyi1996ppo pushed a commit that referenced this pull request Jul 13, 2025

[0.9.1][Perf] Port MLA multistream optimazition and prefetch to v0.9.1 (

9a5e650

#1750) This PR port the optimization in PR #1353 to v0.9.1-dev. Signed-off-by: whx-sjtu <2952154980@qq.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Perf] Improve MLA multistream performance#1353

[Perf] Improve MLA multistream performance#1353
wangxiyuan merged 1 commit intovllm-project:mainfrom
ApsarasX:improve-mla-multistream

ApsarasX commented Jun 22, 2025 •

edited by github-actions bot

Loading

Uh oh!

codecov bot commented Jun 22, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jun 25, 2025

Uh oh!

ApsarasX commented Jul 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ApsarasX commented Jun 22, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Before Improvement

After Improvement

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

codecov bot commented Jun 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Jun 25, 2025

Uh oh!

ApsarasX commented Jul 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ApsarasX commented Jun 22, 2025 •

edited by github-actions bot

Loading

codecov bot commented Jun 22, 2025 •

edited

Loading