fix a corner case that could cause out-of-sync with async scheduling and dp >1 (#28774)

bangshengtang · facebook-github-bot · commit 25be1453a7c3 · 2025-11-14T21:50:33.000-08:00
Summary: Pull Request resolved: #28774 we observed a comms issue when both async scheduling and DP>1 are enabled when async scheduling is enabled, num_scheduled_tokens can become zero, and DP>1 requires coordinate_batch_across_dp to be called into where an all_reduce across dp ranks is required. returning early would cause dp ranks go out-of-sync basically we got ``` terminate called after throwing an instance of 'gloo::EnforceNotMet' what(): [enforce fail at /mnt/code/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:454] op.preamble.length <= op.nbytes. 8 vs 4 I1114 16:43:23 native_sampler.py:491 1199692:MainThread] rank=2: has_unfinished_requests W1114 16:43:23.344583 1233050 ExceptionTracer.cpp:193] Invalid trace stack for exception of type: gloo::EnforceNotMet terminate called after throwing an instance of 'gloo::EnforceNotMet' what(): [enforce fail at /mnt/code/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:454] op.preamble.length <= op.nbytes. 8 vs 4 ``` when dp_rank 0 is calling into has_unfinished_requests(), while dp_rank 1 is waiting at coordinate_batch_across_dp() and log indicate that this happens in the previous step dp_rank 0 returned early with num_scheduled_tokens=0 Test Plan: with the change, our workload runs for multiple ours without triggering this issue. **before** ``` INFO 11-14 19:28:01 [gpu_model_runner.py:2533] rank=2, num_scheduled_tokens=0 INFO 11-14 19:28:01 [gpu_model_runner.py:2533] rank=3, num_scheduled_tokens=0 ERROR 11-14 19:28:01 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=CachedRequestData(req_ids=[], resumed_req_ids=[], new_token_ids=[], all_token_ids={}, new_block_ids=[], num_computed_tokens=[], num_output_tokens=[]), num_scheduled_tokens={}, total_num_scheduled_tokens=0, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 0, 0, 3], finished_req_ids=[], free_encoder_mm_hashes=[], pending_structured_output_tokens=false, kv_connector_metadata=null) ERROR 11-14 19:28:01 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=CachedRequestData(req_ids=[], resumed_req_ids=[], new_token_ids=[], all_token_ids={}, new_block_ids=[], num_computed_tokens=[], num_output_tokens=[]), num_scheduled_tokens={}, total_num_scheduled_tokens=0, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 0, 0, 3], finished_req_ids=[], free_encoder_mm_hashes=[], pending_structured_output_tokens=false, kv_connector_metadata=null) ``` the server crashes immediately after the first num_scheduled_tokens=0 case **after** ``` INFO 11-14 19:42:55 [gpu_model_runner.py:2533] rank=1, num_scheduled_tokens=0 INFO 11-14 19:42:55 [gpu_model_runner.py:2533] rank=0, num_scheduled_tokens=0 INFO 11-14 19:47:23 [gpu_model_runner.py:2533] rank=1, num_scheduled_tokens=0 INFO 11-14 19:47:23 [gpu_model_runner.py:2533] rank=0, num_scheduled_tokens=0 INFO 11-14 20:03:04 [gpu_model_runner.py:2533] rank=3, num_scheduled_tokens=0 INFO 11-14 20:03:04 [gpu_model_runner.py:2533] rank=2, num_scheduled_tokens=0 INFO 11-14 20:31:43 [gpu_model_runner.py:2533] rank=1, num_scheduled_tokens=0 INFO 11-14 20:31:43 [gpu_model_runner.py:2533] rank=0, num_scheduled_tokens=0 INFO 11-14 20:35:39 [gpu_model_runner.py:2533] rank=1, num_scheduled_tokens=0 INFO 11-14 20:35:39 [gpu_model_runner.py:2533] rank=0, num_scheduled_tokens=0 INFO 11-14 20:48:08 [gpu_model_runner.py:2533] rank=0, num_scheduled_tokens=0 INFO 11-14 20:48:08 [gpu_model_runner.py:2533] rank=1, num_scheduled_tokens=0 INFO 11-14 20:48:13 [gpu_model_runner.py:2533] rank=2, num_scheduled_tokens=0 INFO 11-14 20:48:13 [gpu_model_runner.py:2533] rank=3, num_scheduled_tokens=0 ``` Differential Revision: D87131186
diff --git a/vllm/v1/worker/gpu_model_runner.py b/vllm/v1/worker/gpu_model_runner.py
@@ -2534,6 +2534,17 @@ def execute_model(
                         return make_empty_encoder_model_runner_output(scheduler_output)
 
                 if not num_scheduled_tokens:
+                    if (
+                        self.scheduler_config.async_scheduling
+                        and self.parallel_config.data_parallel_size > 1
+                    ):
+                        # this is a corner case when both async scheduling
+                        # and DP are enabled, num_scheduled_tokens could be
+                        # 0, and has_unfinished_requests in the outer loop
+                        # returns True. before returning early here we call
+                        # dummy run to ensure coordinate_batch_across_dp
+                        # is called into to avoid out of sync issues.
+                        self._dummy_run(1)
                     if not has_kv_transfer_group():
                         # Return empty ModelRunnerOutput if no work to do.
                         return EMPTY_MODEL_RUNNER_OUTPUT