Skip to content

[bugfix] Support pipeline parallellism for Deepseek V3.2 DSA-CP #6589

Open
zzhx1 wants to merge 2 commits intovllm-project:mainfrom
zzhx1:sfa-cp-pp
Open

[bugfix] Support pipeline parallellism for Deepseek V3.2 DSA-CP #6589
zzhx1 wants to merge 2 commits intovllm-project:mainfrom
zzhx1:sfa-cp-pp

Conversation

@zzhx1
Copy link
Copy Markdown
Contributor

@zzhx1 zzhx1 commented Feb 6, 2026

What this PR does / why we need it?

Support deepseek 3.2 higher throughput, fix dsa-cp support for pp. depend on #6282

Does this PR introduce any user-facing change?

How was this patch tested?

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Feb 6, 2026

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @zzhx1, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the support for Deepseek V3.2 DSA-CP by improving pipeline parallelism and addressing throughput concerns. The changes focus on robust synchronization mechanisms for layer sharding and precise management of intermediate tensors across parallel processing units. These updates are crucial for maintaining stability and efficiency when running large models in a distributed environment, particularly in scenarios involving sequence parallelism.

Highlights

  • Pipeline Parallelism Synchronization: Introduced a new wait_all_layers function to explicitly synchronize layer sharding operations, ensuring all pending work is completed before proceeding.
  • Intermediate Tensor Management: Added a sync_and_slice_intermediate_tensors method to correctly handle and distribute intermediate tensor data across different tensor parallel ranks, especially when sequence parallelism is enabled.
  • Layer Sharding Initialization: Implemented sorting for layer_sharding_kwargs during the initialization of SFAImpl to ensure a consistent order for layer sharding configuration.
  • Memory Optimization in Dummy Run: Refactored the _dummy_run method to utilize the new intermediate tensor synchronization and slicing logic, which helps in accurate memory estimation and prevents Out-Of-Memory (OOM) errors in pipeline parallelism with sequence parallelism.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • vllm_ascend/attention/sfa_v1.py
    • Imported wait_all_layers from vllm_ascend.ops.layer_shard_linear and extract_layer_index from vllm.model_executor.models.utils.
    • Added a sorting step for self.layer_sharding_kwargs based on layer prefix during initialization.
    • Integrated a call to wait_all_layers() within the forward method to ensure synchronization after KV cache operations.
  • vllm_ascend/ops/layer_shard_linear.py
    • Implemented the wait_all_layers function, which iterates through all registered layer shard series and waits for any outstanding work in their shard windows.
    • Minor adjustment to comments in the reach_layer method, with no functional change to the calculation of next_layer_idx.
  • vllm_ascend/worker/model_runner_v1.py
    • Introduced a new method sync_and_slice_intermediate_tensors to handle the synchronization and slicing of intermediate tensors based on tensor parallel size and sequence parallelism status.
    • Updated the _dummy_run method to leverage the new sync_and_slice_intermediate_tensors for initializing and managing self.intermediate_tensors, improving memory allocation logic for parallel execution.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several important fixes and refactorings to better support pipeline parallelism for Deepseek V3.2 models with DSA-CP. The changes are well-structured and address critical issues for distributed inference.

Key improvements include:

  • A crucial bug fix in the layer sharding prefetching logic to ensure correct weight loading.
  • The addition of necessary synchronization points (wait_all_layers) for pipeline parallelism.
  • A clean refactoring of intermediate tensor handling during dummy runs, which simplifies the code and resolves potential memory estimation errors.

The code quality is good, and the changes are logical and necessary for the feature. I did not find any high or critical issues in this pull request.

Following the repository's style guide, here are the suggested updates for the PR title and summary:

Suggested PR Title:

[Parallelism][BugFix] Support pipeline parallelism for Deepseek V3.2 DSA-CP

Suggested PR Summary:

### What this PR does / why we need it?

This PR enables pipeline parallelism (PP) for Deepseek V3.2 models using Dynamic Sparse Attention with Context Parallelism (DSA-CP). The changes address several key areas to ensure correct and efficient distributed execution:

- **Fixes Layer Sharding Prefetching:** Corrects a bug in the layer prefetching logic where an incorrect layer index was calculated. This ensures the right weights are loaded asynchronously.
- **Adds Pipeline Synchronization:** Introduces `wait_all_layers()` as a synchronization barrier at the end of the attention forward pass. This is essential for coordinating weight loading between pipeline stages.
- **Refactors Intermediate Tensor Handling:** Simplifies the logic for managing intermediate tensors in `_dummy_run` for non-first pipeline ranks. This improves code clarity and fixes memory estimation issues that could lead to OOM errors.
- **Ensures Deterministic Layer Registration:** Sorts layer sharding configurations to guarantee a deterministic registration order, preventing potential inconsistencies in a distributed setup.

These changes are critical for supporting high-throughput inference of Deepseek V3.2 models in a pipeline parallel environment.

Fixes #6282

### Does this PR introduce _any_ user-facing change?

No. These are backend improvements for model parallelism and do not affect user-facing APIs.

### How was this patch tested?

The changes were validated by running Deepseek V3.2 models with pipeline parallelism enabled. Correctness was verified by comparing outputs with single-GPU execution. Performance and memory usage were monitored to confirm the effectiveness of the memory estimation fixes.
- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/d7e17aaacd5ed1b4b4be6bcfef3a1b7cbc84fc9a

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Feb 6, 2026

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: zzhxx <zhangzihang23@mails.ucas.ac.cn>
Signed-off-by: zhuohuan <zxdu1997@gmail.com>
@github-actions
Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants