refactor context parallel state #17213
Conversation
Summary of ChangesHello @dongjiyingdjy, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request refactors the context parallel state to support attention and MoE context parallelism. It introduces new group coordinators and updates initialization functions to manage these parallel processing capabilities. The changes also include modifications to scheduler processes and the addition of new server arguments for configuration. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces context parallelism for attention and MoE layers, which is a significant refactoring. The changes are extensive, touching many files to plumb through the new configuration and rank information. The core logic for creating the new parallel groups has been added.
My review focuses on the correctness of the new parallelism group initialization. While the logic for attention context parallelism (_ATTN_CP) and other groups seems correct, I've found a critical issue in the initialization of the MoE context parallel group (_MOE_CP). The current implementation appears to create groups across pipeline stages instead of within a single stage, which is incorrect for context parallelism.
I've provided a detailed comment with a suggested fix for this issue. Please address this to ensure the correctness of MoE context parallelism.
Also, there is a small typo in the pull request title: "refatcor" should be "refactor".
| for i in range(num_tensor_model_parallel_groups): | ||
| for j in range(moe_tp_size * moe_ep_size): | ||
| st = i * tensor_model_parallel_size + j | ||
| en = (i + 1) * tensor_model_parallel_size + j | ||
| ranks = list(range(st, en, moe_tp_size * moe_ep_size)) | ||
| group_ranks.append(ranks) |
There was a problem hiding this comment.
The logic for creating the _MOE_CP (MoE Context Parallel) group appears to be incorrect. It seems to be creating groups across pipeline parallel stages, similar to how pipeline parallel groups are formed, rather than creating context parallel groups within a single pipeline stage.
A context parallel group for MoE should group ranks that handle different parts of the context but the same expert and tensor slice. The current implementation:
for i in range(num_tensor_model_parallel_groups):
for j in range(moe_tp_size * moe_ep_size):
st = i * tensor_model_parallel_size + j
en = (i + 1) * tensor_model_parallel_size + j
ranks = list(range(st, en, moe_tp_size * moe_ep_size))
group_ranks.append(ranks)Here, i iterates through pipeline stages, and en points to a rank in the next pipeline stage, which is incorrect for a context parallel group.
A correct implementation should iterate within a single pipeline stage. Assuming a rank layout of (cp, ep, tp) within a tensor parallel group, the logic should be something like this:
for i in range(num_tensor_model_parallel_groups):
for j in range(moe_ep_size):
for k in range(moe_tp_size):
# Assuming a rank layout of (cp, ep, tp)
base = i * tensor_model_parallel_size + j * moe_tp_size + k
stride = moe_ep_size * moe_tp_size
ranks = [base + c * stride for c in range(moe_cp_size)]
group_ranks.append(ranks)|
Please make sure this PR can pass this unit test |
Both tests already passed. Thanks! |
3d7a871 to
71be3c3
Compare
|
/tag-and-rerun-ci |
Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
) PR #17213 added attn_cp_rank and moe_dp_rank parameters to run_scheduler_process but the gRPC scheduler_launcher was not updated, causing startup failure due to missing arguments.
) PR #17213 added attn_cp_rank and moe_dp_rank parameters to run_scheduler_process but the gRPC scheduler_launcher was not updated, causing startup failure due to missing arguments.
) PR #17213 added attn_cp_rank and moe_dp_rank parameters to run_scheduler_process but the gRPC scheduler_launcher was not updated, causing startup failure due to missing arguments.
|
This PR disable PP+CP, will this be supported in the future? |
Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co>
Motivation
Context parallelism is essential in long context LLM inference. It splits a long input sequence across multiple GPUs so attention can be computed in parallel, drastically reducing latency, which enables practical million-token context windows.
Previously, DeepSeek-V3.2 already supported CP and could use it together with DP. We aim to support the combination of CP, DP, and TP, and make it easier to apply to other models. To achieve this, we first refactored the original implementation.
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci