[bugfix] Support pipeline parallellism for Deepseek V3.2 DSA-CP by zzhx1 · Pull Request #6589 · vllm-project/vllm-ascend

zzhx1 · 2026-02-06T03:26:45Z

What this PR does / why we need it?

Support deepseek 3.2 higher throughput, fix dsa-cp support for pp. depend on #6282

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.15.0
vLLM main: vllm-project/vllm@d7e17aa

github-actions · 2026-02-06T03:26:58Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist · 2026-02-06T03:27:02Z

Summary of Changes

Hello @zzhx1, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the support for Deepseek V3.2 DSA-CP by improving pipeline parallelism and addressing throughput concerns. The changes focus on robust synchronization mechanisms for layer sharding and precise management of intermediate tensors across parallel processing units. These updates are crucial for maintaining stability and efficiency when running large models in a distributed environment, particularly in scenarios involving sequence parallelism.

Highlights

Pipeline Parallelism Synchronization: Introduced a new wait_all_layers function to explicitly synchronize layer sharding operations, ensuring all pending work is completed before proceeding.
Intermediate Tensor Management: Added a sync_and_slice_intermediate_tensors method to correctly handle and distribute intermediate tensor data across different tensor parallel ranks, especially when sequence parallelism is enabled.
Layer Sharding Initialization: Implemented sorting for layer_sharding_kwargs during the initialization of SFAImpl to ensure a consistent order for layer sharding configuration.
Memory Optimization in Dummy Run: Refactored the _dummy_run method to utilize the new intermediate tensor synchronization and slicing logic, which helps in accurate memory estimation and prevents Out-Of-Memory (OOM) errors in pipeline parallelism with sequence parallelism.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

vllm_ascend/attention/sfa_v1.py
- Imported wait_all_layers from vllm_ascend.ops.layer_shard_linear and extract_layer_index from vllm.model_executor.models.utils.
- Added a sorting step for self.layer_sharding_kwargs based on layer prefix during initialization.
- Integrated a call to wait_all_layers() within the forward method to ensure synchronization after KV cache operations.
vllm_ascend/ops/layer_shard_linear.py
- Implemented the wait_all_layers function, which iterates through all registered layer shard series and waits for any outstanding work in their shard windows.
- Minor adjustment to comments in the reach_layer method, with no functional change to the calculation of next_layer_idx.
vllm_ascend/worker/model_runner_v1.py
- Introduced a new method sync_and_slice_intermediate_tensors to handle the synchronization and slicing of intermediate tensors based on tensor parallel size and sequence parallelism status.
- Updated the _dummy_run method to leverage the new sync_and_slice_intermediate_tensors for initializing and managing self.intermediate_tensors, improving memory allocation logic for parallel execution.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces several important fixes and refactorings to better support pipeline parallelism for Deepseek V3.2 models with DSA-CP. The changes are well-structured and address critical issues for distributed inference.

Key improvements include:

A crucial bug fix in the layer sharding prefetching logic to ensure correct weight loading.
The addition of necessary synchronization points (wait_all_layers) for pipeline parallelism.
A clean refactoring of intermediate tensor handling during dummy runs, which simplifies the code and resolves potential memory estimation errors.

The code quality is good, and the changes are logical and necessary for the feature. I did not find any high or critical issues in this pull request.

Following the repository's style guide, here are the suggested updates for the PR title and summary:

Suggested PR Title:

[Parallelism][BugFix] Support pipeline parallelism for Deepseek V3.2 DSA-CP

Suggested PR Summary:

### What this PR does / why we need it?

This PR enables pipeline parallelism (PP) for Deepseek V3.2 models using Dynamic Sparse Attention with Context Parallelism (DSA-CP). The changes address several key areas to ensure correct and efficient distributed execution:

- **Fixes Layer Sharding Prefetching:** Corrects a bug in the layer prefetching logic where an incorrect layer index was calculated. This ensures the right weights are loaded asynchronously.
- **Adds Pipeline Synchronization:** Introduces `wait_all_layers()` as a synchronization barrier at the end of the attention forward pass. This is essential for coordinating weight loading between pipeline stages.
- **Refactors Intermediate Tensor Handling:** Simplifies the logic for managing intermediate tensors in `_dummy_run` for non-first pipeline ranks. This improves code clarity and fixes memory estimation issues that could lead to OOM errors.
- **Ensures Deterministic Layer Registration:** Sorts layer sharding configurations to guarantee a deterministic registration order, preventing potential inconsistencies in a distributed setup.

These changes are critical for supporting high-throughput inference of Deepseek V3.2 models in a pipeline parallel environment.

Fixes #6282

### Does this PR introduce _any_ user-facing change?

No. These are backend improvements for model parallelism and do not affect user-facing APIs.

### How was this patch tested?

The changes were validated by running Deepseek V3.2 models with pipeline parallelism enabled. Correctness was verified by comparing outputs with single-GPU execution. Performance and memory usage were monitored to confirm the effectiveness of the memory estimation fixes.
- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/d7e17aaacd5ed1b4b4be6bcfef3a1b7cbc84fc9a

github-actions · 2026-02-06T05:01:08Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: zzhxx <zhangzihang23@mails.ucas.ac.cn>

Signed-off-by: zhuohuan <zxdu1997@gmail.com>

github-actions · 2026-04-29T06:44:35Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

zzhx1 requested review from MengqingCao, realliujiaxu, weijinqian0, whx-sjtu and zzzzwwjj as code owners February 6, 2026 03:26

github-actions Bot added the module:ops label Feb 6, 2026

zzhx1 closed this Feb 6, 2026

gemini-code-assist Bot reviewed Feb 6, 2026

View reviewed changes

zzhx1 reopened this Feb 6, 2026

github-actions Bot added the merge-conflicts label Feb 6, 2026

zzhx1 force-pushed the sfa-cp-pp branch from 7d04a5f to 7f70d3e Compare March 2, 2026 02:25

zzhx1 requested review from LCAIZJ and wangxiyuan as code owners March 2, 2026 02:25

supoort dsa-cp with pipeline parallelism

600554d

Signed-off-by: zzhxx <zhangzihang23@mails.ucas.ac.cn>

zzhx1 force-pushed the sfa-cp-pp branch from 7f70d3e to 600554d Compare March 25, 2026 11:22

github-actions Bot removed the merge-conflicts label Mar 25, 2026

[fix]:fix pp errors when applying flashcomm1/sp

171f1a7

Signed-off-by: zhuohuan <zxdu1997@gmail.com>

github-actions Bot added the merge-conflicts label Apr 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bugfix] Support pipeline parallellism for Deepseek V3.2 DSA-CP #6589

[bugfix] Support pipeline parallellism for Deepseek V3.2 DSA-CP #6589
zzhx1 wants to merge 2 commits intovllm-project:mainfrom
zzhx1:sfa-cp-pp

zzhx1 commented Feb 6, 2026 •

edited by github-actions Bot

Loading

Uh oh!

github-actions Bot commented Feb 6, 2026

Uh oh!

gemini-code-assist Bot commented Feb 6, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

github-actions Bot commented Feb 6, 2026

Uh oh!

github-actions Bot commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zzhx1 commented Feb 6, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions Bot commented Feb 6, 2026

Uh oh!

gemini-code-assist Bot commented Feb 6, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

github-actions Bot commented Feb 6, 2026

Uh oh!

github-actions Bot commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zzhx1 commented Feb 6, 2026 •

edited by github-actions Bot

Loading