Skip to content

Perf(PP): support PP with async send/recv.#7143

Merged
MengqingCao merged 3 commits intovllm-project:mainfrom
pisceskkk:pp/async_comm
Mar 15, 2026
Merged

Perf(PP): support PP with async send/recv.#7143
MengqingCao merged 3 commits intovllm-project:mainfrom
pisceskkk:pp/async_comm

Conversation

@pisceskkk
Copy link
Copy Markdown
Contributor

@pisceskkk pisceskkk commented Mar 11, 2026

What this PR does / why we need it?

Follow up the PR vllm-project/vllm#33368, this PR provides async send/recv support for PP in vllm-ascend.

How was this patch tested?

Launch server:

vllm serve /mnt/share/DeepSeek-V3.1-Terminus-w8a8-QuaRot-lfs \
    -tp 4 -pp 4 \
    --enable-expert-parallel \
    --max-num-seqs 16 \
    --max-model-len 16384 \
    --max-num-batched-tokens 16384 \
    --gpu-memory-utilization 0.95 \
    --enforce-eager \
    --quantization ascend 

Acc

ais_bench --models vllm_api_general_stream --datasets gsm8k_gen_0_shot_cot_chat_prompt --summarizer example --dump-eval-details --merge-ds --debug
| dataset | version | metric | mode | vllm-api-general-stream |
|----- | ----- | ----- | ----- | -----|
| gsm8kdataset | - | accuracy | gen | 96.88 |

Perf

We use first 192 requests of gsm8k with output_len=512 and concurrency=32 to test the performance.

ais_bench --models vllm_api_general_stream --datasets gsm8k_gen_0_shot_cot_chat_prompt --summarizer stable_stage --mode perf --debug

TP4PP4 w/o async_comm

╒══════════════════════════╤═════════╤════════════════╤════════════════╤═══════════════╤════════════════╤═══════════════╤════════════════╤════════════════╤═════╕
│ Performance Parameters   │ Stage   │ Average        │ Min            │ Max           │ Median         │ P75           │ P90            │ P99            │  N  │
╞══════════════════════════╪═════════╪════════════════╪════════════════╪═══════════════╪════════════════╪═══════════════╪════════════════╪════════════════╪═════╡
│ E2EL                     │ stable  │ 60808.291 ms   │ 36970.3065 ms  │ 95466.1041 ms │ 57094.1573 ms  │ 70298.6203 ms │ 85978.3062 ms  │ 92327.9702 ms  │ 160 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼───────────────┼────────────────┼────────────────┼─────┤
│ TTFT                     │ stable  │ 31258.8542 ms  │ 21012.2119 ms  │ 41980.8771 ms │ 31020.4824 ms  │ 35207.6049 ms │ 38376.0513 ms  │ 41498.4549 ms  │ 160 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼───────────────┼────────────────┼────────────────┼─────┤
│ TPOT                     │ stable  │ 111.6517 ms    │ 102.3503 ms    │ 121.5439 ms   │ 111.7984 ms    │ 113.1814 ms   │ 114.6871 ms    │ 118.7533 ms    │ 160 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼───────────────┼────────────────┼────────────────┼─────┤
│ ITL                      │ stable  │ 110.9294 ms    │ 0.0114 ms      │ 376.9445 ms   │ 105.3454 ms    │ 122.0904 ms   │ 128.4072 ms    │ 301.9703 ms    │ 160 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼───────────────┼────────────────┼────────────────┼─────┤
│ InputTokens              │ stable  │ 96.2625        │ 63.0           │ 166.0         │ 90.0           │ 106.0         │ 128.0          │ 163.05         │ 160 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼───────────────┼────────────────┼────────────────┼─────┤
│ OutputTokens             │ stable  │ 266.3562       │ 79.0           │ 512.0         │ 216.0          │ 349.0         │ 512.0          │ 512.0          │ 160 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼───────────────┼────────────────┼────────────────┼─────┤
│ OutputTokenThroughput    │ stable  │ 4.1415 token/s │ 1.6352 token/s │ 6.326 token/s │ 4.0194 token/s │ 5.265 token/s │ 5.9415 token/s │ 6.2313 token/s │ 160 │
╘══════════════════════════╧═════════╧════════════════╧════════════════╧═══════════════╧════════════════╧═══════════════╧════════════════╧════════════════╧═════╛
╒══════════════════════════╤═════════╤══════════════════╕
│ Common Metric            │ Stage   │ Value            │
╞══════════════════════════╪═════════╪══════════════════╡
│ Benchmark Duration       │ stable  │ 306174.3192 ms   │
├──────────────────────────┼─────────┼──────────────────┤
│ Total Requests           │ stable  │ 160              │
├──────────────────────────┼─────────┼──────────────────┤
│ Failed Requests          │ stable  │ 0                │
├──────────────────────────┼─────────┼──────────────────┤
│ Success Requests         │ stable  │ 160              │
├──────────────────────────┼─────────┼──────────────────┤
│ Concurrency              │ stable  │ 31.7771          │
├──────────────────────────┼─────────┼──────────────────┤
│ Max Concurrency          │ stable  │ 32               │
├──────────────────────────┼─────────┼──────────────────┤
│ Request Throughput       │ stable  │ 0.5226 req/s     │
├──────────────────────────┼─────────┼──────────────────┤
│ Total Input Tokens       │ stable  │ 15402            │
├──────────────────────────┼─────────┼──────────────────┤
│ Prefill Token Throughput │ stable  │ 3.0795 token/s   │
├──────────────────────────┼─────────┼──────────────────┤
│ Total generated tokens   │ stable  │ 42617            │
├──────────────────────────┼─────────┼──────────────────┤
│ Input Token Throughput   │ stable  │ 50.3047 token/s  │
├──────────────────────────┼─────────┼──────────────────┤
│ Output Token Throughput  │ stable  │ 139.1919 token/s │
├──────────────────────────┼─────────┼──────────────────┤
│ Total Token Throughput   │ stable  │ 189.4966 token/s │
╘══════════════════════════╧═════════╧══════════════════╛

This PR
TP4PP4 w/ async_comm

╒══════════════════════════╤═════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤═════╕
│ Performance Parameters   │ Stage   │ Average        │ Min            │ Max            │ Median         │ P75            │ P90            │ P99            │  N  │
╞══════════════════════════╪═════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪═════╡
│ E2EL                     │ stable  │ 57323.2423 ms  │ 33004.4612 ms  │ 94499.1861 ms  │ 52507.0165 ms  │ 68567.6413 ms  │ 83569.326 ms   │ 90748.0637 ms  │ 160 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TTFT                     │ stable  │ 29371.7566 ms  │ 19181.7381 ms  │ 38583.9645 ms  │ 29121.3816 ms  │ 31798.7549 ms  │ 34395.7401 ms  │ 37567.0002 ms  │ 160 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TPOT                     │ stable  │ 110.1783 ms    │ 101.9609 ms    │ 115.6911 ms    │ 110.4653 ms    │ 111.7588 ms    │ 112.5478 ms    │ 114.7689 ms    │ 160 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ ITL                      │ stable  │ 109.6919 ms    │ 0.0121 ms      │ 364.2795 ms    │ 103.5522 ms    │ 119.4574 ms    │ 125.7936 ms    │ 297.5965 ms    │ 160 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ InputTokens              │ stable  │ 96.2625        │ 63.0           │ 166.0          │ 90.0           │ 106.0          │ 128.0          │ 163.05         │ 160 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokens             │ stable  │ 254.7938       │ 80.0           │ 512.0          │ 206.0          │ 333.0          │ 512.0          │ 512.0          │ 160 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokenThroughput    │ stable  │ 4.1723 token/s │ 1.9579 token/s │ 6.4077 token/s │ 3.9791 token/s │ 5.0227 token/s │ 5.9979 token/s │ 6.3588 token/s │ 160 │
╘══════════════════════════╧═════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧═════╛
╒══════════════════════════╤═════════╤══════════════════╕
│ Common Metric            │ Stage   │ Value            │
╞══════════════════════════╪═════════╪══════════════════╡
│ Benchmark Duration       │ stable  │ 284838.9657 ms   │
├──────────────────────────┼─────────┼──────────────────┤
│ Total Requests           │ stable  │ 160              │
├──────────────────────────┼─────────┼──────────────────┤
│ Failed Requests          │ stable  │ 0                │
├──────────────────────────┼─────────┼──────────────────┤
│ Success Requests         │ stable  │ 160              │
├──────────────────────────┼─────────┼──────────────────┤
│ Concurrency              │ stable  │ 32               │
├──────────────────────────┼─────────┼──────────────────┤
│ Max Concurrency          │ stable  │ 32               │
├──────────────────────────┼─────────┼──────────────────┤
│ Request Throughput       │ stable  │ 0.5617 req/s     │
├──────────────────────────┼─────────┼──────────────────┤
│ Total Input Tokens       │ stable  │ 15402            │
├──────────────────────────┼─────────┼──────────────────┤
│ Prefill Token Throughput │ stable  │ 3.2774 token/s   │
├──────────────────────────┼─────────┼──────────────────┤
│ Total generated tokens   │ stable  │ 40767            │
├──────────────────────────┼─────────┼──────────────────┤
│ Input Token Throughput   │ stable  │ 54.0727 token/s  │
├──────────────────────────┼─────────┼──────────────────┤
│ Output Token Throughput  │ stable  │ 143.123 token/s  │
├──────────────────────────┼─────────┼──────────────────┤
│ Total Token Throughput   │ stable  │ 197.1956 token/s │
╘══════════════════════════╧═════════╧══════════════════╛

Summary

PP w/ async_comm v.s. PP w/o async_comm

  • 4% Total Throughput improvement
  • 6% TTFT reducation.
  • 1.3% TPOT reducation.

@pisceskkk pisceskkk requested a review from MengqingCao as a code owner March 11, 2026 02:27
@github-actions
Copy link
Copy Markdown
Contributor

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the vllm-ascend framework by integrating asynchronous communication capabilities for pipeline parallelism. The changes aim to boost overall performance by enabling computation and communication to occur concurrently, thereby reducing latency and increasing throughput during model execution. This is achieved through a new mechanism for handling intermediate tensors and managing communication handles.

Highlights

  • Asynchronous Communication for Pipeline Parallelism: Introduced asynchronous send/receive operations (isend_tensor_dict, irecv_tensor_dict) for Pipeline Parallelism (PP) communication within the NPUWorker to improve performance.
  • New AsyncIntermediateTensors Class: Added AsyncIntermediateTensors to manage lazy communication synchronization, allowing computation to overlap with communication by deferring the waiting for communication handles until the tensors are actually accessed.
  • Performance Improvements: Demonstrated significant performance gains, including improvements in total throughput, and reductions in Time To First Token (TTFT) and Time Per Output Token (TPOT) due to the asynchronous communication.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • vllm_ascend/worker/worker.py
    • Added Callable and Handle imports to support asynchronous operations.
    • Introduced the AsyncIntermediateTensors class for lazy synchronization of communication handles and post-processing functions.
    • Modified the NPUWorker to store and manage asynchronous send work handles (_pp_send_work).
    • Updated the execute_model method to use irecv_tensor_dict for non-blocking tensor reception and isend_tensor_dict for non-blocking tensor transmission, integrating with the new AsyncIntermediateTensors.
Activity
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces asynchronous send/receive operations for pipeline parallelism to improve performance. The implementation uses a new AsyncIntermediateTensors class for lazy synchronization of received tensors. The overall approach is sound. However, the use of __getattribute__ for lazy synchronization in AsyncIntermediateTensors can cause issues with torch.compile, potentially leading to graph breaks and performance degradation in compiled mode. I've suggested a more explicit, compiler-friendly approach.

Comment thread vllm_ascend/worker/worker.py Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b4be6a996f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread vllm_ascend/worker/worker.py
@pisceskkk pisceskkk force-pushed the pp/async_comm branch 3 times, most recently from b38fb53 to 530e817 Compare March 11, 2026 08:10
Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
@MengqingCao MengqingCao added ready read for review ready-for-test start test by label for PR labels Mar 13, 2026
@MengqingCao
Copy link
Copy Markdown
Collaborator

plz rebase your code after #7230 merged

@MengqingCao MengqingCao merged commit 7daccf4 into vllm-project:main Mar 15, 2026
53 of 55 checks passed
Nagisa125 pushed a commit to starmountain1997/vllm-ascend that referenced this pull request Mar 17, 2026
### What this PR does / why we need it?
Follow up the PR vllm-project/vllm#33368, this
PR provides async send/recv support for PP in vllm-ascend.

---
- vLLM version: v0.17.0
- vLLM main:
vllm-project/vllm@4034c3d

Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants