[BugFix] Corner case that could cause out-of-sync with external launcher mode and dp >1 #28774

bangshengtang · 2025-11-15T05:29:39Z

Summary:
we observed a comms issue when both async scheduling and DP>1 are enabled

when async scheduling is enabled, num_scheduled_tokens can become zero, and DP>1 requires coordinate_batch_across_dp to be called into where an all_reduce across dp ranks is required. returning early would cause dp ranks go out-of-sync

basically we got

terminate called after throwing an instance of 'gloo::EnforceNotMet'
  what():  [enforce fail at /mnt/code/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:454] op.preamble.length <= op.nbytes. 8 vs 4
I1114 16:43:23 native_sampler.py:491 1199692:MainThread] rank=2: has_unfinished_requests
W1114 16:43:23.344583 1233050 ExceptionTracer.cpp:193] Invalid trace stack for exception of type: gloo::EnforceNotMet
terminate called after throwing an instance of 'gloo::EnforceNotMet'
  what():  [enforce fail at /mnt/code/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:454] op.preamble.length <= op.nbytes. 8 vs 4

when dp_rank 0 is calling into has_unfinished_requests(), while dp_rank 1 is waiting at coordinate_batch_across_dp()

and log indicate that this happens in the previous step dp_rank 0 returned early with num_scheduled_tokens=0

Test Plan:
with the change, our workload runs for multiple ours without triggering this issue.

before

INFO 11-14 19:28:01 [gpu_model_runner.py:2533] rank=2, num_scheduled_tokens=0
INFO 11-14 19:28:01 [gpu_model_runner.py:2533] rank=3, num_scheduled_tokens=0
ERROR 11-14 19:28:01 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=CachedRequestData(req_ids=[], resumed_req_ids=[], new_token_ids=[], all_token_ids={}, new_block_ids=[], num_computed_tokens=[], num_output_tokens=[]), num_scheduled_tokens={}, total_num_scheduled_tokens=0, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 0, 0, 3], finished_req_ids=[], free_encoder_mm_hashes=[], pending_structured_output_tokens=false, kv_connector_metadata=null)
ERROR 11-14 19:28:01 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=CachedRequestData(req_ids=[], resumed_req_ids=[], new_token_ids=[], all_token_ids={}, new_block_ids=[], num_computed_tokens=[], num_output_tokens=[]), num_scheduled_tokens={}, total_num_scheduled_tokens=0, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 0, 0, 3], finished_req_ids=[], free_encoder_mm_hashes=[], pending_structured_output_tokens=false, kv_connector_metadata=null)

the server crashes immediately after the first num_scheduled_tokens=0 case

after

INFO 11-14 19:42:55 [gpu_model_runner.py:2533] rank=1, num_scheduled_tokens=0
INFO 11-14 19:42:55 [gpu_model_runner.py:2533] rank=0, num_scheduled_tokens=0
INFO 11-14 19:47:23 [gpu_model_runner.py:2533] rank=1, num_scheduled_tokens=0
INFO 11-14 19:47:23 [gpu_model_runner.py:2533] rank=0, num_scheduled_tokens=0
INFO 11-14 20:03:04 [gpu_model_runner.py:2533] rank=3, num_scheduled_tokens=0
INFO 11-14 20:03:04 [gpu_model_runner.py:2533] rank=2, num_scheduled_tokens=0
INFO 11-14 20:31:43 [gpu_model_runner.py:2533] rank=1, num_scheduled_tokens=0
INFO 11-14 20:31:43 [gpu_model_runner.py:2533] rank=0, num_scheduled_tokens=0
INFO 11-14 20:35:39 [gpu_model_runner.py:2533] rank=1, num_scheduled_tokens=0
INFO 11-14 20:35:39 [gpu_model_runner.py:2533] rank=0, num_scheduled_tokens=0
INFO 11-14 20:48:08 [gpu_model_runner.py:2533] rank=0, num_scheduled_tokens=0
INFO 11-14 20:48:08 [gpu_model_runner.py:2533] rank=1, num_scheduled_tokens=0
INFO 11-14 20:48:13 [gpu_model_runner.py:2533] rank=2, num_scheduled_tokens=0
INFO 11-14 20:48:13 [gpu_model_runner.py:2533] rank=3, num_scheduled_tokens=0

Differential Revision: D87131186

gemini-code-assist

Code Review

This pull request addresses a critical synchronization issue that can occur when using asynchronous scheduling with data parallelism. The fix in gpu_model_runner.py correctly introduces a dummy run to ensure all data-parallel ranks stay in sync, even when no tokens are scheduled, preventing potential deadlocks. The changes in rocm_aiter_fa.py, while not explicitly described, appear to be a defensive measure to ensure min_seqlen_q is always at least 1, which is a reasonable safeguard. Overall, the changes are well-targeted and improve the robustness of the system in distributed environments.

…and dp >1 (vllm-project#28774) Summary: Pull Request resolved: vllm-project#28774 we observed a comms issue when both async scheduling and DP>1 are enabled when async scheduling is enabled, num_scheduled_tokens can become zero, and DP>1 requires coordinate_batch_across_dp to be called into where an all_reduce across dp ranks is required. returning early would cause dp ranks go out-of-sync basically we got ``` terminate called after throwing an instance of 'gloo::EnforceNotMet' what(): [enforce fail at /mnt/code/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:454] op.preamble.length <= op.nbytes. 8 vs 4 I1114 16:43:23 native_sampler.py:491 1199692:MainThread] rank=2: has_unfinished_requests W1114 16:43:23.344583 1233050 ExceptionTracer.cpp:193] Invalid trace stack for exception of type: gloo::EnforceNotMet terminate called after throwing an instance of 'gloo::EnforceNotMet' what(): [enforce fail at /mnt/code/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:454] op.preamble.length <= op.nbytes. 8 vs 4 ``` when dp_rank 0 is calling into has_unfinished_requests(), while dp_rank 1 is waiting at coordinate_batch_across_dp() and log indicate that this happens in the previous step dp_rank 0 returned early with num_scheduled_tokens=0 Test Plan: with the change, our workload runs for multiple ours without triggering this issue. **before** ``` INFO 11-14 19:28:01 [gpu_model_runner.py:2533] rank=2, num_scheduled_tokens=0 INFO 11-14 19:28:01 [gpu_model_runner.py:2533] rank=3, num_scheduled_tokens=0 ERROR 11-14 19:28:01 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=CachedRequestData(req_ids=[], resumed_req_ids=[], new_token_ids=[], all_token_ids={}, new_block_ids=[], num_computed_tokens=[], num_output_tokens=[]), num_scheduled_tokens={}, total_num_scheduled_tokens=0, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 0, 0, 3], finished_req_ids=[], free_encoder_mm_hashes=[], pending_structured_output_tokens=false, kv_connector_metadata=null) ERROR 11-14 19:28:01 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=CachedRequestData(req_ids=[], resumed_req_ids=[], new_token_ids=[], all_token_ids={}, new_block_ids=[], num_computed_tokens=[], num_output_tokens=[]), num_scheduled_tokens={}, total_num_scheduled_tokens=0, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 0, 0, 3], finished_req_ids=[], free_encoder_mm_hashes=[], pending_structured_output_tokens=false, kv_connector_metadata=null) ``` the server crashes immediately after the first num_scheduled_tokens=0 case **after** ``` INFO 11-14 19:42:55 [gpu_model_runner.py:2533] rank=1, num_scheduled_tokens=0 INFO 11-14 19:42:55 [gpu_model_runner.py:2533] rank=0, num_scheduled_tokens=0 INFO 11-14 19:47:23 [gpu_model_runner.py:2533] rank=1, num_scheduled_tokens=0 INFO 11-14 19:47:23 [gpu_model_runner.py:2533] rank=0, num_scheduled_tokens=0 INFO 11-14 20:03:04 [gpu_model_runner.py:2533] rank=3, num_scheduled_tokens=0 INFO 11-14 20:03:04 [gpu_model_runner.py:2533] rank=2, num_scheduled_tokens=0 INFO 11-14 20:31:43 [gpu_model_runner.py:2533] rank=1, num_scheduled_tokens=0 INFO 11-14 20:31:43 [gpu_model_runner.py:2533] rank=0, num_scheduled_tokens=0 INFO 11-14 20:35:39 [gpu_model_runner.py:2533] rank=1, num_scheduled_tokens=0 INFO 11-14 20:35:39 [gpu_model_runner.py:2533] rank=0, num_scheduled_tokens=0 INFO 11-14 20:48:08 [gpu_model_runner.py:2533] rank=0, num_scheduled_tokens=0 INFO 11-14 20:48:08 [gpu_model_runner.py:2533] rank=1, num_scheduled_tokens=0 INFO 11-14 20:48:13 [gpu_model_runner.py:2533] rank=2, num_scheduled_tokens=0 INFO 11-14 20:48:13 [gpu_model_runner.py:2533] rank=3, num_scheduled_tokens=0 ``` Differential Revision: D87131186

bangshengtang · 2025-11-15T06:02:23Z

cc: @njhill @WoosukKwon

houseroad

Looks good.

njhill · 2025-11-17T17:02:32Z

Thanks @bangshengtang, reviewing this now...

njhill

@bangshengtang do you have a reproducer for this? We do have CI coverage for DP which seems to work fine with async scheduling enabled. Is this a race condition that's hard to trigger?

I'm trying to understand the condition in which this transpires, since I think it should be covered here:

vllm/vllm/v1/engine/core.py

Lines 1226 to 1237 in e42bd8c

    
           executed = self._process_engine_step() 
        
           self._maybe_publish_request_counts() 
        
           local_unfinished_reqs = self.scheduler.has_unfinished_requests() 
        
           if not executed: 
        
               if not local_unfinished_reqs and not self.engines_running: 
        
                   # All engines are idle. 
        
                   continue 
        
               # We are in a running state and so must execute a dummy pass 
        
               # if the model didn't execute any ready requests. 
        
               self.execute_dummy_batch()

I'd like to see whether there's a fix which can be made in the engine core loop orchestration since that's what currently controls the dummy runs or else it's split between multiple places.

FWIW though, with the upcoming V2 model runner refactor I think we will be moving this into the runner.

…and dp >1 (vllm-project#28774) Summary: we observed a comms issue when both async scheduling and DP>1 are enabled when async scheduling is enabled, num_scheduled_tokens can become zero, and DP>1 requires coordinate_batch_across_dp to be called into where an all_reduce across dp ranks is required. returning early would cause dp ranks go out-of-sync basically we got ``` terminate called after throwing an instance of 'gloo::EnforceNotMet' what(): [enforce fail at /mnt/code/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:454] op.preamble.length <= op.nbytes. 8 vs 4 I1114 16:43:23 native_sampler.py:491 1199692:MainThread] rank=2: has_unfinished_requests W1114 16:43:23.344583 1233050 ExceptionTracer.cpp:193] Invalid trace stack for exception of type: gloo::EnforceNotMet terminate called after throwing an instance of 'gloo::EnforceNotMet' what(): [enforce fail at /mnt/code/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:454] op.preamble.length <= op.nbytes. 8 vs 4 ``` when dp_rank 0 is calling into has_unfinished_requests(), while dp_rank 1 is waiting at coordinate_batch_across_dp() and log indicate that this happens in the previous step dp_rank 0 returned early with num_scheduled_tokens=0 Test Plan: with the change, our workload runs for multiple ours without triggering this issue. **before** ``` INFO 11-14 19:28:01 [gpu_model_runner.py:2533] rank=2, num_scheduled_tokens=0 INFO 11-14 19:28:01 [gpu_model_runner.py:2533] rank=3, num_scheduled_tokens=0 ERROR 11-14 19:28:01 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=CachedRequestData(req_ids=[], resumed_req_ids=[], new_token_ids=[], all_token_ids={}, new_block_ids=[], num_computed_tokens=[], num_output_tokens=[]), num_scheduled_tokens={}, total_num_scheduled_tokens=0, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 0, 0, 3], finished_req_ids=[], free_encoder_mm_hashes=[], pending_structured_output_tokens=false, kv_connector_metadata=null) ERROR 11-14 19:28:01 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=CachedRequestData(req_ids=[], resumed_req_ids=[], new_token_ids=[], all_token_ids={}, new_block_ids=[], num_computed_tokens=[], num_output_tokens=[]), num_scheduled_tokens={}, total_num_scheduled_tokens=0, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 0, 0, 3], finished_req_ids=[], free_encoder_mm_hashes=[], pending_structured_output_tokens=false, kv_connector_metadata=null) ``` the server crashes immediately after the first num_scheduled_tokens=0 case **after** ``` INFO 11-14 19:42:55 [gpu_model_runner.py:2533] rank=1, num_scheduled_tokens=0 INFO 11-14 19:42:55 [gpu_model_runner.py:2533] rank=0, num_scheduled_tokens=0 INFO 11-14 19:47:23 [gpu_model_runner.py:2533] rank=1, num_scheduled_tokens=0 INFO 11-14 19:47:23 [gpu_model_runner.py:2533] rank=0, num_scheduled_tokens=0 INFO 11-14 20:03:04 [gpu_model_runner.py:2533] rank=3, num_scheduled_tokens=0 INFO 11-14 20:03:04 [gpu_model_runner.py:2533] rank=2, num_scheduled_tokens=0 INFO 11-14 20:31:43 [gpu_model_runner.py:2533] rank=1, num_scheduled_tokens=0 INFO 11-14 20:31:43 [gpu_model_runner.py:2533] rank=0, num_scheduled_tokens=0 INFO 11-14 20:35:39 [gpu_model_runner.py:2533] rank=1, num_scheduled_tokens=0 INFO 11-14 20:35:39 [gpu_model_runner.py:2533] rank=0, num_scheduled_tokens=0 INFO 11-14 20:48:08 [gpu_model_runner.py:2533] rank=0, num_scheduled_tokens=0 INFO 11-14 20:48:08 [gpu_model_runner.py:2533] rank=1, num_scheduled_tokens=0 INFO 11-14 20:48:13 [gpu_model_runner.py:2533] rank=2, num_scheduled_tokens=0 INFO 11-14 20:48:13 [gpu_model_runner.py:2533] rank=3, num_scheduled_tokens=0 ``` Reviewed By: houseroad Differential Revision: D87131186

njhill

Thanks @bangshengtang. Per slack discussion we can include this for now as a fix for external launcher + DP.

You need to sign-off your commit for the DCO though: https://github.com/vllm-project/vllm/pull/28774/checks?check_run_id=55627622958

…her mode and dp >1 (vllm-project#28774)

…her mode and dp >1 (#28774) Signed-off-by: jiang1.li <[email protected]>

…her mode and dp >1 (vllm-project#28774)

bangshengtang requested a review from tjtanaa as a code owner November 15, 2025 05:29

mergify bot added the v1 label Nov 15, 2025

gemini-code-assist bot reviewed Nov 15, 2025

View reviewed changes

bangshengtang force-pushed the export-D87131186 branch from ad7959a to 25be145 Compare November 15, 2025 05:55

22quinn requested review from WoosukKwon and njhill November 15, 2025 07:38

njhill mentioned this pull request Nov 15, 2025

Async Scheduling Plan #27679

Open

19 tasks

houseroad approved these changes Nov 17, 2025

View reviewed changes

houseroad added ready-for-merge Indicate this PR is ready to be merged by the maintainers, used by reviewers without merge access. ready ONLY add when PR is ready to merge/full CI is needed labels Nov 17, 2025

njhill changed the title ~~fix a corner case that could cause out-of-sync with async scheduling and dp >1~~ [BugFix] Corner case that could cause out-of-sync with async scheduling and dp >1 Nov 17, 2025

njhill requested changes Nov 17, 2025

View reviewed changes

bangshengtang force-pushed the export-D87131186 branch from 367af0b to 082e371 Compare November 17, 2025 19:28

bangshengtang force-pushed the export-D87131186 branch from 082e371 to 04c32dd Compare November 17, 2025 19:33

njhill approved these changes Nov 17, 2025

View reviewed changes

njhill changed the title ~~[BugFix] Corner case that could cause out-of-sync with async scheduling and dp >1~~ [BugFix] Corner case that could cause out-of-sync with external launcher mode and dp >1 Nov 17, 2025

zhuohan123 enabled auto-merge (squash) November 17, 2025 19:48

zhuohan123 disabled auto-merge November 17, 2025 23:22

zhuohan123 merged commit 6148584 into vllm-project:main Nov 17, 2025
43 of 45 checks passed

Victor49152 pushed a commit to Victor49152/vllm that referenced this pull request Nov 20, 2025

[BugFix] Corner case that could cause out-of-sync with external launc…

d0931b4

…her mode and dp >1 (vllm-project#28774)

bigPYJ1151 pushed a commit that referenced this pull request Nov 25, 2025

[BugFix] Corner case that could cause out-of-sync with external launc…

f9af98f

…her mode and dp >1 (#28774) Signed-off-by: jiang1.li <[email protected]>

xieyangxu mentioned this pull request Nov 25, 2025

dummy run corner case #29433

Merged

bringlein pushed a commit to bringlein/vllm that referenced this pull request Nov 26, 2025

[BugFix] Corner case that could cause out-of-sync with external launc…

919f41e

…her mode and dp >1 (vllm-project#28774)

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

[BugFix] Corner case that could cause out-of-sync with external launc…

551a239

…her mode and dp >1 (vllm-project#28774)

kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025

[BugFix] Corner case that could cause out-of-sync with external launc…

55f2396

…her mode and dp >1 (vllm-project#28774)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[BugFix] Corner case that could cause out-of-sync with external launcher mode and dp >1 #28774

[BugFix] Corner case that could cause out-of-sync with external launcher mode and dp >1 #28774

Uh oh!

bangshengtang commented Nov 15, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

bangshengtang commented Nov 15, 2025

Uh oh!

houseroad left a comment

Uh oh!

njhill commented Nov 17, 2025

Uh oh!

njhill left a comment •

edited

Loading

Uh oh!

njhill left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	executed = self._process_engine_step()
	self._maybe_publish_request_counts()

	local_unfinished_reqs = self.scheduler.has_unfinished_requests()
	if not executed:
	if not local_unfinished_reqs and not self.engines_running:
	# All engines are idle.
	continue

	# We are in a running state and so must execute a dummy pass
	# if the model didn't execute any ready requests.
	self.execute_dummy_batch()

Uh oh!

[BugFix] Corner case that could cause out-of-sync with external launcher mode and dp >1 #28774

[BugFix] Corner case that could cause out-of-sync with external launcher mode and dp >1 #28774

Uh oh!

Conversation

bangshengtang commented Nov 15, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

bangshengtang commented Nov 15, 2025

Uh oh!

houseroad left a comment

Choose a reason for hiding this comment

Uh oh!

njhill commented Nov 17, 2025

Uh oh!

njhill left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

njhill left a comment •

edited

Loading