[0.13.0][Bugfix] Add synced_cudagraph_mode to limit mixed graph modes in dp ranks#6011
Conversation
…es in dp ranks Signed-off-by: Zetong Li <slippersss@126.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a mechanism to synchronize the CUDAGraph mode across data-parallel ranks to prevent hangs when different ranks operate in mixed eager/graph modes. The changes are logical and well-implemented, introducing a synced_cudagraph_mode that is used to control graph dispatching. However, I've identified a critical issue where this synchronization is skipped in a specific optimization path for MoE models, which could lead to the very problem this PR aims to solve. My review includes a suggested fix for this issue.
| if self._skip_all_reduce_across_dp_group(): | ||
| num_tokens_after_padding = torch.tensor([num_tokens] * | ||
| self.dp_size, | ||
| device="cpu", | ||
| dtype=torch.int32) | ||
| return num_tokens, num_tokens_after_padding, with_prefill | ||
| return num_tokens, num_tokens_after_padding, with_prefill, cudagraph_mode |
There was a problem hiding this comment.
When _skip_all_reduce_across_dp_group() is true, the all_reduce operation for syncing metadata is skipped. However, this also skips syncing cudagraph_mode, returning the local cudagraph_mode instead. This could lead to different ranks operating in different CUDAGraph modes, which is the exact issue this pull request aims to fix and could cause hangs.
The cudagraph_mode should be synced across all DP ranks regardless of whether other metadata syncing is skipped.
if self._skip_all_reduce_across_dp_group():
# Even if we skip syncing num_tokens, we must sync cudagraph_mode.
mode_tensor = torch.tensor([cudagraph_mode], dtype=torch.int32, device="cpu")
dist.all_reduce(mode_tensor, op=dist.ReduceOp.MIN, group=get_dp_group().cpu_group)
synced_cudagraph_mode = mode_tensor.item()
num_tokens_after_padding = torch.tensor([num_tokens] *
self.dp_size,
device="cpu",
dtype=torch.int32)
return num_tokens, num_tokens_after_padding, with_prefill, synced_cudagraph_mode| if self._skip_all_reduce_across_dp_group(): | ||
| num_tokens_after_padding = torch.tensor([num_tokens] * | ||
| self.dp_size, | ||
| device="cpu", | ||
| dtype=torch.int32) | ||
| return num_tokens, num_tokens_after_padding, with_prefill | ||
| return num_tokens, num_tokens_after_padding, with_prefill, cudagraph_mode |
There was a problem hiding this comment.
@jianzs Hi, as what you mentioned in #5979, when we skip all_reduce, there is still a possibility that different dp ranks may run different graph modes. Since we have met the issue that A2 + AIV will hang, so we have to ensure that there is no prefill when entering this _skip_all_reduce_across_dp_group branch.
…lm-ascend into FIA_v0.13.0 * 'releases/v0.13.0' of https://github.com/vllm-project/vllm-ascend: [0.13.0][Bugfix] Add `synced_cudagraph_mode` to limit mixed graph modes in dp ranks (vllm-project#6011)
…es in dp ranks (vllm-project#6011) ### What this PR does / why we need it? This PR aims to fix the issue that using A2 + AIV will hang due to the fact that HCCL does not support eager/graph mode communication. To handle it, following vllm-project/vllm#30173, we introduce `synced_cudagraph_mode` to enable all ranks to know the minimum mode across ranks. Main changes are described below: 1. `execute_model` now performs "dispatch -> sync -> re-dispatch" just as `_dummy_run` 2. `_sync_metadata_across_dp` now receives `cudagraph_mode` from all ranks and returns `synced_cudagraph_mode` to all ranks 3. Re-dispatch steps in both `execute_model` and `_dummy_run` include `disable_full=synced_cudagraph_mode <= CUDAGraphMode.PIECEWISE.value` so that when it is true, no FULL will be dispatched ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? by ci --------- Signed-off-by: Zetong Li <slippersss@126.com>
What this PR does / why we need it?
This PR aims to fix the issue that using A2 + AIV will hang due to the fact that HCCL does not support eager/graph mode communication. To handle it, following vllm-project/vllm#30173, we introduce
synced_cudagraph_modeto enable all ranks to know the minimum mode across ranks. Main changes are described below:execute_modelnow performs "dispatch -> sync -> re-dispatch" just as_dummy_run_sync_metadata_across_dpnow receivescudagraph_modefrom all ranks and returnssynced_cudagraph_modeto all ranksexecute_modeland_dummy_runincludedisable_full=synced_cudagraph_mode <= CUDAGraphMode.PIECEWISE.valueso that when it is true, no FULL will be dispatchedDoes this PR introduce any user-facing change?
N/A
How was this patch tested?
by ci