[Do not merge][Async] Asynchronous DP coordination#30980
[Do not merge][Async] Asynchronous DP coordination#30980MatthewBonanni wants to merge 2 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
There was a problem hiding this comment.
Code Review
This pull request introduces asynchronous data parallelism (DP) coordination to overlap communication with computation, which is a solid approach to improve performance by reducing GPU idle time. The changes are well-structured, introducing start_dp_coordination_async and finish_dp_coordination functions to manage the asynchronous operation. However, I've identified two critical issues. One is a faulty assertion in dp_utils.py that could lead to runtime crashes when DP ranks have conflicting configurations. The other is a bug in gpu_model_runner.py where a padded token count is used for a check that expects an unpadded count, potentially leading to incorrect behavior. Addressing these issues is crucial for the correctness and stability of this feature.
|
Holding off on this for now to evaluate a different approach |
|
Closing, superseded by #32951 |
Purpose
When using async scheduling with MTP and DP,
_get_valid_sampled_token_countcauses a sync point. Following this sync point, DP coordination causes a GPU bubble. This PR creates and employs an asynchronous replacement forcoordinate_batch_across_dp, allowing this communication to overlap with execution.Test Plan
with
Test Result
Correctness
Performance
mean TPOT: 268.02 ms -> 260.72 = 2.7% speedup
Main:
PR:
Profile
Main: Bubble evident

PR: No bubble

Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.