[bugfix] Support pipeline parallellism for Deepseek V3.2 DSA-CP #6589
[bugfix] Support pipeline parallellism for Deepseek V3.2 DSA-CP #6589zzhx1 wants to merge 2 commits intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
Summary of ChangesHello @zzhx1, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the support for Deepseek V3.2 DSA-CP by improving pipeline parallelism and addressing throughput concerns. The changes focus on robust synchronization mechanisms for layer sharding and precise management of intermediate tensors across parallel processing units. These updates are crucial for maintaining stability and efficiency when running large models in a distributed environment, particularly in scenarios involving sequence parallelism. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces several important fixes and refactorings to better support pipeline parallelism for Deepseek V3.2 models with DSA-CP. The changes are well-structured and address critical issues for distributed inference.
Key improvements include:
- A crucial bug fix in the layer sharding prefetching logic to ensure correct weight loading.
- The addition of necessary synchronization points (
wait_all_layers) for pipeline parallelism. - A clean refactoring of intermediate tensor handling during dummy runs, which simplifies the code and resolves potential memory estimation errors.
The code quality is good, and the changes are logical and necessary for the feature. I did not find any high or critical issues in this pull request.
Following the repository's style guide, here are the suggested updates for the PR title and summary:
Suggested PR Title:
[Parallelism][BugFix] Support pipeline parallelism for Deepseek V3.2 DSA-CPSuggested PR Summary:
### What this PR does / why we need it?
This PR enables pipeline parallelism (PP) for Deepseek V3.2 models using Dynamic Sparse Attention with Context Parallelism (DSA-CP). The changes address several key areas to ensure correct and efficient distributed execution:
- **Fixes Layer Sharding Prefetching:** Corrects a bug in the layer prefetching logic where an incorrect layer index was calculated. This ensures the right weights are loaded asynchronously.
- **Adds Pipeline Synchronization:** Introduces `wait_all_layers()` as a synchronization barrier at the end of the attention forward pass. This is essential for coordinating weight loading between pipeline stages.
- **Refactors Intermediate Tensor Handling:** Simplifies the logic for managing intermediate tensors in `_dummy_run` for non-first pipeline ranks. This improves code clarity and fixes memory estimation issues that could lead to OOM errors.
- **Ensures Deterministic Layer Registration:** Sorts layer sharding configurations to guarantee a deterministic registration order, preventing potential inconsistencies in a distributed setup.
These changes are critical for supporting high-throughput inference of Deepseek V3.2 models in a pipeline parallel environment.
Fixes #6282
### Does this PR introduce _any_ user-facing change?
No. These are backend improvements for model parallelism and do not affect user-facing APIs.
### How was this patch tested?
The changes were validated by running Deepseek V3.2 models with pipeline parallelism enabled. Correctness was verified by comparing outputs with single-GPU execution. Performance and memory usage were monitored to confirm the effectiveness of the memory estimation fixes.
- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/d7e17aaacd5ed1b4b4be6bcfef3a1b7cbc84fc9a|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: zzhxx <zhangzihang23@mails.ucas.ac.cn>
Signed-off-by: zhuohuan <zxdu1997@gmail.com>
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
What this PR does / why we need it?
Support deepseek 3.2 higher throughput, fix dsa-cp support for pp. depend on #6282
Does this PR introduce any user-facing change?
How was this patch tested?