Skip to content

Commit 25be145

Browse files
bangshengtangfacebook-github-bot
authored andcommitted
fix a corner case that could cause out-of-sync with async scheduling and dp >1 (#28774)
Summary: Pull Request resolved: #28774 we observed a comms issue when both async scheduling and DP>1 are enabled when async scheduling is enabled, num_scheduled_tokens can become zero, and DP>1 requires coordinate_batch_across_dp to be called into where an all_reduce across dp ranks is required. returning early would cause dp ranks go out-of-sync basically we got ``` terminate called after throwing an instance of 'gloo::EnforceNotMet' what(): [enforce fail at /mnt/code/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:454] op.preamble.length <= op.nbytes. 8 vs 4 I1114 16:43:23 native_sampler.py:491 1199692:MainThread] rank=2: has_unfinished_requests W1114 16:43:23.344583 1233050 ExceptionTracer.cpp:193] Invalid trace stack for exception of type: gloo::EnforceNotMet terminate called after throwing an instance of 'gloo::EnforceNotMet' what(): [enforce fail at /mnt/code/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:454] op.preamble.length <= op.nbytes. 8 vs 4 ``` when dp_rank 0 is calling into has_unfinished_requests(), while dp_rank 1 is waiting at coordinate_batch_across_dp() and log indicate that this happens in the previous step dp_rank 0 returned early with num_scheduled_tokens=0 Test Plan: with the change, our workload runs for multiple ours without triggering this issue. **before** ``` INFO 11-14 19:28:01 [gpu_model_runner.py:2533] rank=2, num_scheduled_tokens=0 INFO 11-14 19:28:01 [gpu_model_runner.py:2533] rank=3, num_scheduled_tokens=0 ERROR 11-14 19:28:01 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=CachedRequestData(req_ids=[], resumed_req_ids=[], new_token_ids=[], all_token_ids={}, new_block_ids=[], num_computed_tokens=[], num_output_tokens=[]), num_scheduled_tokens={}, total_num_scheduled_tokens=0, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 0, 0, 3], finished_req_ids=[], free_encoder_mm_hashes=[], pending_structured_output_tokens=false, kv_connector_metadata=null) ERROR 11-14 19:28:01 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=CachedRequestData(req_ids=[], resumed_req_ids=[], new_token_ids=[], all_token_ids={}, new_block_ids=[], num_computed_tokens=[], num_output_tokens=[]), num_scheduled_tokens={}, total_num_scheduled_tokens=0, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 0, 0, 3], finished_req_ids=[], free_encoder_mm_hashes=[], pending_structured_output_tokens=false, kv_connector_metadata=null) ``` the server crashes immediately after the first num_scheduled_tokens=0 case **after** ``` INFO 11-14 19:42:55 [gpu_model_runner.py:2533] rank=1, num_scheduled_tokens=0 INFO 11-14 19:42:55 [gpu_model_runner.py:2533] rank=0, num_scheduled_tokens=0 INFO 11-14 19:47:23 [gpu_model_runner.py:2533] rank=1, num_scheduled_tokens=0 INFO 11-14 19:47:23 [gpu_model_runner.py:2533] rank=0, num_scheduled_tokens=0 INFO 11-14 20:03:04 [gpu_model_runner.py:2533] rank=3, num_scheduled_tokens=0 INFO 11-14 20:03:04 [gpu_model_runner.py:2533] rank=2, num_scheduled_tokens=0 INFO 11-14 20:31:43 [gpu_model_runner.py:2533] rank=1, num_scheduled_tokens=0 INFO 11-14 20:31:43 [gpu_model_runner.py:2533] rank=0, num_scheduled_tokens=0 INFO 11-14 20:35:39 [gpu_model_runner.py:2533] rank=1, num_scheduled_tokens=0 INFO 11-14 20:35:39 [gpu_model_runner.py:2533] rank=0, num_scheduled_tokens=0 INFO 11-14 20:48:08 [gpu_model_runner.py:2533] rank=0, num_scheduled_tokens=0 INFO 11-14 20:48:08 [gpu_model_runner.py:2533] rank=1, num_scheduled_tokens=0 INFO 11-14 20:48:13 [gpu_model_runner.py:2533] rank=2, num_scheduled_tokens=0 INFO 11-14 20:48:13 [gpu_model_runner.py:2533] rank=3, num_scheduled_tokens=0 ``` Differential Revision: D87131186
1 parent 681bd93 commit 25be145

File tree

1 file changed

+11
-0
lines changed

1 file changed

+11
-0
lines changed

vllm/v1/worker/gpu_model_runner.py

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2534,6 +2534,17 @@ def execute_model(
25342534
return make_empty_encoder_model_runner_output(scheduler_output)
25352535

25362536
if not num_scheduled_tokens:
2537+
if (
2538+
self.scheduler_config.async_scheduling
2539+
and self.parallel_config.data_parallel_size > 1
2540+
):
2541+
# this is a corner case when both async scheduling
2542+
# and DP are enabled, num_scheduled_tokens could be
2543+
# 0, and has_unfinished_requests in the outer loop
2544+
# returns True. before returning early here we call
2545+
# dummy run to ensure coordinate_batch_across_dp
2546+
# is called into to avoid out of sync issues.
2547+
self._dummy_run(1)
25372548
if not has_kv_transfer_group():
25382549
# Return empty ModelRunnerOutput if no work to do.
25392550
return EMPTY_MODEL_RUNNER_OUTPUT

0 commit comments

Comments
 (0)