Perf(PP): support PP with async send/recv. by pisceskkk · Pull Request #7143 · vllm-project/vllm-ascend

pisceskkk · 2026-03-11T02:27:31Z

What this PR does / why we need it?

Follow up the PR vllm-project/vllm#33368, this PR provides async send/recv support for PP in vllm-ascend.

How was this patch tested?

Launch server:

vllm serve /mnt/share/DeepSeek-V3.1-Terminus-w8a8-QuaRot-lfs \
    -tp 4 -pp 4 \
    --enable-expert-parallel \
    --max-num-seqs 16 \
    --max-model-len 16384 \
    --max-num-batched-tokens 16384 \
    --gpu-memory-utilization 0.95 \
    --enforce-eager \
    --quantization ascend

Acc

ais_bench --models vllm_api_general_stream --datasets gsm8k_gen_0_shot_cot_chat_prompt --summarizer example --dump-eval-details --merge-ds --debug

| dataset | version | metric | mode | vllm-api-general-stream |
|----- | ----- | ----- | ----- | -----|
| gsm8kdataset | - | accuracy | gen | 96.88 |

Perf

We use first 192 requests of gsm8k with output_len=512 and concurrency=32 to test the performance.

ais_bench --models vllm_api_general_stream --datasets gsm8k_gen_0_shot_cot_chat_prompt --summarizer stable_stage --mode perf --debug

TP4PP4 w/o async_comm

╒══════════════════════════╤═════════╤════════════════╤════════════════╤═══════════════╤════════════════╤═══════════════╤════════════════╤════════════════╤═════╕
│ Performance Parameters   │ Stage   │ Average        │ Min            │ Max           │ Median         │ P75           │ P90            │ P99            │  N  │
╞══════════════════════════╪═════════╪════════════════╪════════════════╪═══════════════╪════════════════╪═══════════════╪════════════════╪════════════════╪═════╡
│ E2EL                     │ stable  │ 60808.291 ms   │ 36970.3065 ms  │ 95466.1041 ms │ 57094.1573 ms  │ 70298.6203 ms │ 85978.3062 ms  │ 92327.9702 ms  │ 160 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼───────────────┼────────────────┼────────────────┼─────┤
│ TTFT                     │ stable  │ 31258.8542 ms  │ 21012.2119 ms  │ 41980.8771 ms │ 31020.4824 ms  │ 35207.6049 ms │ 38376.0513 ms  │ 41498.4549 ms  │ 160 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼───────────────┼────────────────┼────────────────┼─────┤
│ TPOT                     │ stable  │ 111.6517 ms    │ 102.3503 ms    │ 121.5439 ms   │ 111.7984 ms    │ 113.1814 ms   │ 114.6871 ms    │ 118.7533 ms    │ 160 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼───────────────┼────────────────┼────────────────┼─────┤
│ ITL                      │ stable  │ 110.9294 ms    │ 0.0114 ms      │ 376.9445 ms   │ 105.3454 ms    │ 122.0904 ms   │ 128.4072 ms    │ 301.9703 ms    │ 160 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼───────────────┼────────────────┼────────────────┼─────┤
│ InputTokens              │ stable  │ 96.2625        │ 63.0           │ 166.0         │ 90.0           │ 106.0         │ 128.0          │ 163.05         │ 160 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼───────────────┼────────────────┼────────────────┼─────┤
│ OutputTokens             │ stable  │ 266.3562       │ 79.0           │ 512.0         │ 216.0          │ 349.0         │ 512.0          │ 512.0          │ 160 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼───────────────┼────────────────┼───────────────┼────────────────┼────────────────┼─────┤
│ OutputTokenThroughput    │ stable  │ 4.1415 token/s │ 1.6352 token/s │ 6.326 token/s │ 4.0194 token/s │ 5.265 token/s │ 5.9415 token/s │ 6.2313 token/s │ 160 │
╘══════════════════════════╧═════════╧════════════════╧════════════════╧═══════════════╧════════════════╧═══════════════╧════════════════╧════════════════╧═════╛
╒══════════════════════════╤═════════╤══════════════════╕
│ Common Metric            │ Stage   │ Value            │
╞══════════════════════════╪═════════╪══════════════════╡
│ Benchmark Duration       │ stable  │ 306174.3192 ms   │
├──────────────────────────┼─────────┼──────────────────┤
│ Total Requests           │ stable  │ 160              │
├──────────────────────────┼─────────┼──────────────────┤
│ Failed Requests          │ stable  │ 0                │
├──────────────────────────┼─────────┼──────────────────┤
│ Success Requests         │ stable  │ 160              │
├──────────────────────────┼─────────┼──────────────────┤
│ Concurrency              │ stable  │ 31.7771          │
├──────────────────────────┼─────────┼──────────────────┤
│ Max Concurrency          │ stable  │ 32               │
├──────────────────────────┼─────────┼──────────────────┤
│ Request Throughput       │ stable  │ 0.5226 req/s     │
├──────────────────────────┼─────────┼──────────────────┤
│ Total Input Tokens       │ stable  │ 15402            │
├──────────────────────────┼─────────┼──────────────────┤
│ Prefill Token Throughput │ stable  │ 3.0795 token/s   │
├──────────────────────────┼─────────┼──────────────────┤
│ Total generated tokens   │ stable  │ 42617            │
├──────────────────────────┼─────────┼──────────────────┤
│ Input Token Throughput   │ stable  │ 50.3047 token/s  │
├──────────────────────────┼─────────┼──────────────────┤
│ Output Token Throughput  │ stable  │ 139.1919 token/s │
├──────────────────────────┼─────────┼──────────────────┤
│ Total Token Throughput   │ stable  │ 189.4966 token/s │
╘══════════════════════════╧═════════╧══════════════════╛

This PR
TP4PP4 w/ async_comm

╒══════════════════════════╤═════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤════════════════╤═════╕
│ Performance Parameters   │ Stage   │ Average        │ Min            │ Max            │ Median         │ P75            │ P90            │ P99            │  N  │
╞══════════════════════════╪═════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪════════════════╪═════╡
│ E2EL                     │ stable  │ 57323.2423 ms  │ 33004.4612 ms  │ 94499.1861 ms  │ 52507.0165 ms  │ 68567.6413 ms  │ 83569.326 ms   │ 90748.0637 ms  │ 160 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TTFT                     │ stable  │ 29371.7566 ms  │ 19181.7381 ms  │ 38583.9645 ms  │ 29121.3816 ms  │ 31798.7549 ms  │ 34395.7401 ms  │ 37567.0002 ms  │ 160 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ TPOT                     │ stable  │ 110.1783 ms    │ 101.9609 ms    │ 115.6911 ms    │ 110.4653 ms    │ 111.7588 ms    │ 112.5478 ms    │ 114.7689 ms    │ 160 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ ITL                      │ stable  │ 109.6919 ms    │ 0.0121 ms      │ 364.2795 ms    │ 103.5522 ms    │ 119.4574 ms    │ 125.7936 ms    │ 297.5965 ms    │ 160 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ InputTokens              │ stable  │ 96.2625        │ 63.0           │ 166.0          │ 90.0           │ 106.0          │ 128.0          │ 163.05         │ 160 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokens             │ stable  │ 254.7938       │ 80.0           │ 512.0          │ 206.0          │ 333.0          │ 512.0          │ 512.0          │ 160 │
├──────────────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼─────┤
│ OutputTokenThroughput    │ stable  │ 4.1723 token/s │ 1.9579 token/s │ 6.4077 token/s │ 3.9791 token/s │ 5.0227 token/s │ 5.9979 token/s │ 6.3588 token/s │ 160 │
╘══════════════════════════╧═════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧════════════════╧═════╛
╒══════════════════════════╤═════════╤══════════════════╕
│ Common Metric            │ Stage   │ Value            │
╞══════════════════════════╪═════════╪══════════════════╡
│ Benchmark Duration       │ stable  │ 284838.9657 ms   │
├──────────────────────────┼─────────┼──────────────────┤
│ Total Requests           │ stable  │ 160              │
├──────────────────────────┼─────────┼──────────────────┤
│ Failed Requests          │ stable  │ 0                │
├──────────────────────────┼─────────┼──────────────────┤
│ Success Requests         │ stable  │ 160              │
├──────────────────────────┼─────────┼──────────────────┤
│ Concurrency              │ stable  │ 32               │
├──────────────────────────┼─────────┼──────────────────┤
│ Max Concurrency          │ stable  │ 32               │
├──────────────────────────┼─────────┼──────────────────┤
│ Request Throughput       │ stable  │ 0.5617 req/s     │
├──────────────────────────┼─────────┼──────────────────┤
│ Total Input Tokens       │ stable  │ 15402            │
├──────────────────────────┼─────────┼──────────────────┤
│ Prefill Token Throughput │ stable  │ 3.2774 token/s   │
├──────────────────────────┼─────────┼──────────────────┤
│ Total generated tokens   │ stable  │ 40767            │
├──────────────────────────┼─────────┼──────────────────┤
│ Input Token Throughput   │ stable  │ 54.0727 token/s  │
├──────────────────────────┼─────────┼──────────────────┤
│ Output Token Throughput  │ stable  │ 143.123 token/s  │
├──────────────────────────┼─────────┼──────────────────┤
│ Total Token Throughput   │ stable  │ 197.1956 token/s │
╘══════════════════════════╧═════════╧══════════════════╛

Summary

PP w/ async_comm v.s. PP w/o async_comm

4% Total Throughput improvement
6% TTFT reducation.
1.3% TPOT reducation.

vLLM version: v0.17.0
vLLM main: vllm-project/vllm@4034c3d

github-actions · 2026-03-11T02:27:44Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist · 2026-03-11T02:27:47Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the vllm-ascend framework by integrating asynchronous communication capabilities for pipeline parallelism. The changes aim to boost overall performance by enabling computation and communication to occur concurrently, thereby reducing latency and increasing throughput during model execution. This is achieved through a new mechanism for handling intermediate tensors and managing communication handles.

Highlights

Asynchronous Communication for Pipeline Parallelism: Introduced asynchronous send/receive operations (isend_tensor_dict, irecv_tensor_dict) for Pipeline Parallelism (PP) communication within the NPUWorker to improve performance.
New AsyncIntermediateTensors Class: Added AsyncIntermediateTensors to manage lazy communication synchronization, allowing computation to overlap with communication by deferring the waiting for communication handles until the tensors are actually accessed.
Performance Improvements: Demonstrated significant performance gains, including improvements in total throughput, and reductions in Time To First Token (TTFT) and Time Per Output Token (TPOT) due to the asynchronous communication.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

vllm_ascend/worker/worker.py
- Added Callable and Handle imports to support asynchronous operations.
- Introduced the AsyncIntermediateTensors class for lazy synchronization of communication handles and post-processing functions.
- Modified the NPUWorker to store and manage asynchronous send work handles (_pp_send_work).
- Updated the execute_model method to use irecv_tensor_dict for non-blocking tensor reception and isend_tensor_dict for non-blocking tensor transmission, integrating with the new AsyncIntermediateTensors.

Activity

This PR is a follow-up to a previous pull request ([Feature] Pipeline Parallel Async send/recv, 2.9% E2E throughput improvement vllm#33368) focusing on asynchronous send/receive support for Pipeline Parallelism in vllm-ascend.
Performance benchmarks were conducted using ais_bench on the gsm8k dataset, demonstrating improved throughput and reduced TTFT/TPOT with asynchronous communication enabled.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces asynchronous send/receive operations for pipeline parallelism to improve performance. The implementation uses a new AsyncIntermediateTensors class for lazy synchronization of received tensors. The overall approach is sound. However, the use of __getattribute__ for lazy synchronization in AsyncIntermediateTensors can cause issues with torch.compile, potentially leading to graph breaks and performance degradation in compiled mode. I've suggested a more explicit, compiler-friendly approach.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b4be6a996f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>

MengqingCao · 2026-03-14T02:29:23Z

plz rebase your code after #7230 merged

### What this PR does / why we need it? Follow up the PR vllm-project/vllm#33368, this PR provides async send/recv support for PP in vllm-ascend. --- - vLLM version: v0.17.0 - vLLM main: vllm-project/vllm@4034c3d Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>

pisceskkk requested a review from MengqingCao as a code owner March 11, 2026 02:27

gemini-code-assist bot reviewed Mar 11, 2026

View reviewed changes

Comment thread vllm_ascend/worker/worker.py Outdated

chatgpt-codex-connector bot reviewed Mar 11, 2026

View reviewed changes

Comment thread vllm_ascend/worker/worker.py

pisceskkk force-pushed the pp/async_comm branch 3 times, most recently from b38fb53 to 530e817 Compare March 11, 2026 08:10

weijinqian0 approved these changes Mar 11, 2026

View reviewed changes

zzzzwwjj approved these changes Mar 11, 2026

View reviewed changes

wangxiyuan approved these changes Mar 11, 2026

View reviewed changes

pisceskkk force-pushed the pp/async_comm branch from 530e817 to 4c5c363 Compare March 11, 2026 08:41

perf(worker): adapt PP async send/recv in NPU worker

4c5c363

Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>

This was referenced Mar 11, 2026

[Feedback]: v0.17.0rc1 Release Feedback #7173

Open

[Release]: Release checklist for v0.17.0rc1 #7172

Closed

Merge branch 'main' into pp/async_comm

1e3d6d3

MengqingCao added ready read for review ready-for-test start test by label for PR labels Mar 13, 2026

Merge branch 'main' into pp/async_comm

af0647a

MengqingCao approved these changes Mar 15, 2026

View reviewed changes

MengqingCao merged commit 7daccf4 into vllm-project:main Mar 15, 2026
53 of 55 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perf(PP): support PP with async send/recv.#7143

Perf(PP): support PP with async send/recv.#7143
MengqingCao merged 3 commits intovllm-project:mainfrom
pisceskkk:pp/async_comm

pisceskkk commented Mar 11, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 11, 2026

Uh oh!

gemini-code-assist bot commented Mar 11, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

MengqingCao commented Mar 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

pisceskkk commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

How was this patch tested?

Acc

Perf

Summary

Uh oh!

github-actions bot commented Mar 11, 2026

Uh oh!

gemini-code-assist bot commented Mar 11, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

MengqingCao commented Mar 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

pisceskkk commented Mar 11, 2026 •

edited

Loading