Skip to content

Conversation

@bangshengtang
Copy link
Collaborator

Summary:
we observed a comms issue when both async scheduling and DP>1 are enabled

when async scheduling is enabled, num_scheduled_tokens can become zero, and DP>1 requires coordinate_batch_across_dp to be called into where an all_reduce across dp ranks is required. returning early would cause dp ranks go out-of-sync

basically we got

terminate called after throwing an instance of 'gloo::EnforceNotMet'
  what():  [enforce fail at /mnt/code/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:454] op.preamble.length <= op.nbytes. 8 vs 4
I1114 16:43:23 native_sampler.py:491 1199692:MainThread] rank=2: has_unfinished_requests
W1114 16:43:23.344583 1233050 ExceptionTracer.cpp:193] Invalid trace stack for exception of type: gloo::EnforceNotMet
terminate called after throwing an instance of 'gloo::EnforceNotMet'
  what():  [enforce fail at /mnt/code/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:454] op.preamble.length <= op.nbytes. 8 vs 4

when dp_rank 0 is calling into has_unfinished_requests(), while dp_rank 1 is waiting at coordinate_batch_across_dp()

and log indicate that this happens in the previous step dp_rank 0 returned early with num_scheduled_tokens=0

Test Plan:
with the change, our workload runs for multiple ours without triggering this issue.

before

INFO 11-14 19:28:01 [gpu_model_runner.py:2533] rank=2, num_scheduled_tokens=0
INFO 11-14 19:28:01 [gpu_model_runner.py:2533] rank=3, num_scheduled_tokens=0
ERROR 11-14 19:28:01 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=CachedRequestData(req_ids=[], resumed_req_ids=[], new_token_ids=[], all_token_ids={}, new_block_ids=[], num_computed_tokens=[], num_output_tokens=[]), num_scheduled_tokens={}, total_num_scheduled_tokens=0, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 0, 0, 3], finished_req_ids=[], free_encoder_mm_hashes=[], pending_structured_output_tokens=false, kv_connector_metadata=null)
ERROR 11-14 19:28:01 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=CachedRequestData(req_ids=[], resumed_req_ids=[], new_token_ids=[], all_token_ids={}, new_block_ids=[], num_computed_tokens=[], num_output_tokens=[]), num_scheduled_tokens={}, total_num_scheduled_tokens=0, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 0, 0, 3], finished_req_ids=[], free_encoder_mm_hashes=[], pending_structured_output_tokens=false, kv_connector_metadata=null)

the server crashes immediately after the first num_scheduled_tokens=0 case

after

INFO 11-14 19:42:55 [gpu_model_runner.py:2533] rank=1, num_scheduled_tokens=0
INFO 11-14 19:42:55 [gpu_model_runner.py:2533] rank=0, num_scheduled_tokens=0
INFO 11-14 19:47:23 [gpu_model_runner.py:2533] rank=1, num_scheduled_tokens=0
INFO 11-14 19:47:23 [gpu_model_runner.py:2533] rank=0, num_scheduled_tokens=0
INFO 11-14 20:03:04 [gpu_model_runner.py:2533] rank=3, num_scheduled_tokens=0
INFO 11-14 20:03:04 [gpu_model_runner.py:2533] rank=2, num_scheduled_tokens=0
INFO 11-14 20:31:43 [gpu_model_runner.py:2533] rank=1, num_scheduled_tokens=0
INFO 11-14 20:31:43 [gpu_model_runner.py:2533] rank=0, num_scheduled_tokens=0
INFO 11-14 20:35:39 [gpu_model_runner.py:2533] rank=1, num_scheduled_tokens=0
INFO 11-14 20:35:39 [gpu_model_runner.py:2533] rank=0, num_scheduled_tokens=0
INFO 11-14 20:48:08 [gpu_model_runner.py:2533] rank=0, num_scheduled_tokens=0
INFO 11-14 20:48:08 [gpu_model_runner.py:2533] rank=1, num_scheduled_tokens=0
INFO 11-14 20:48:13 [gpu_model_runner.py:2533] rank=2, num_scheduled_tokens=0
INFO 11-14 20:48:13 [gpu_model_runner.py:2533] rank=3, num_scheduled_tokens=0

Differential Revision: D87131186

@mergify mergify bot added the v1 label Nov 15, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a critical synchronization issue that can occur when using asynchronous scheduling with data parallelism. The fix in gpu_model_runner.py correctly introduces a dummy run to ensure all data-parallel ranks stay in sync, even when no tokens are scheduled, preventing potential deadlocks. The changes in rocm_aiter_fa.py, while not explicitly described, appear to be a defensive measure to ensure min_seqlen_q is always at least 1, which is a reasonable safeguard. Overall, the changes are well-targeted and improve the robustness of the system in distributed environments.

bangshengtang added a commit to bangshengtang/vllm that referenced this pull request Nov 15, 2025
…and dp >1 (vllm-project#28774)

Summary:
Pull Request resolved: vllm-project#28774

we observed a comms issue when both async scheduling and DP>1 are enabled

when async scheduling is enabled, num_scheduled_tokens can become zero, and DP>1 requires coordinate_batch_across_dp to be called into where an all_reduce across dp ranks is required. returning early would cause dp ranks go out-of-sync

basically we got
```
terminate called after throwing an instance of 'gloo::EnforceNotMet'
  what():  [enforce fail at /mnt/code/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:454] op.preamble.length <= op.nbytes. 8 vs 4
I1114 16:43:23 native_sampler.py:491 1199692:MainThread] rank=2: has_unfinished_requests
W1114 16:43:23.344583 1233050 ExceptionTracer.cpp:193] Invalid trace stack for exception of type: gloo::EnforceNotMet
terminate called after throwing an instance of 'gloo::EnforceNotMet'
  what():  [enforce fail at /mnt/code/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:454] op.preamble.length <= op.nbytes. 8 vs 4
```
when dp_rank 0 is calling into has_unfinished_requests(), while dp_rank 1 is waiting at coordinate_batch_across_dp()

and log indicate that this happens in the previous step dp_rank 0 returned early with num_scheduled_tokens=0

Test Plan:
with the change, our workload runs for multiple ours without triggering this issue.

**before**
```
INFO 11-14 19:28:01 [gpu_model_runner.py:2533] rank=2, num_scheduled_tokens=0
INFO 11-14 19:28:01 [gpu_model_runner.py:2533] rank=3, num_scheduled_tokens=0
ERROR 11-14 19:28:01 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=CachedRequestData(req_ids=[], resumed_req_ids=[], new_token_ids=[], all_token_ids={}, new_block_ids=[], num_computed_tokens=[], num_output_tokens=[]), num_scheduled_tokens={}, total_num_scheduled_tokens=0, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 0, 0, 3], finished_req_ids=[], free_encoder_mm_hashes=[], pending_structured_output_tokens=false, kv_connector_metadata=null)
ERROR 11-14 19:28:01 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=CachedRequestData(req_ids=[], resumed_req_ids=[], new_token_ids=[], all_token_ids={}, new_block_ids=[], num_computed_tokens=[], num_output_tokens=[]), num_scheduled_tokens={}, total_num_scheduled_tokens=0, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 0, 0, 3], finished_req_ids=[], free_encoder_mm_hashes=[], pending_structured_output_tokens=false, kv_connector_metadata=null)
```
the server crashes immediately after the first num_scheduled_tokens=0 case

**after**
```
INFO 11-14 19:42:55 [gpu_model_runner.py:2533] rank=1, num_scheduled_tokens=0
INFO 11-14 19:42:55 [gpu_model_runner.py:2533] rank=0, num_scheduled_tokens=0
INFO 11-14 19:47:23 [gpu_model_runner.py:2533] rank=1, num_scheduled_tokens=0
INFO 11-14 19:47:23 [gpu_model_runner.py:2533] rank=0, num_scheduled_tokens=0
INFO 11-14 20:03:04 [gpu_model_runner.py:2533] rank=3, num_scheduled_tokens=0
INFO 11-14 20:03:04 [gpu_model_runner.py:2533] rank=2, num_scheduled_tokens=0
INFO 11-14 20:31:43 [gpu_model_runner.py:2533] rank=1, num_scheduled_tokens=0
INFO 11-14 20:31:43 [gpu_model_runner.py:2533] rank=0, num_scheduled_tokens=0
INFO 11-14 20:35:39 [gpu_model_runner.py:2533] rank=1, num_scheduled_tokens=0
INFO 11-14 20:35:39 [gpu_model_runner.py:2533] rank=0, num_scheduled_tokens=0
INFO 11-14 20:48:08 [gpu_model_runner.py:2533] rank=0, num_scheduled_tokens=0
INFO 11-14 20:48:08 [gpu_model_runner.py:2533] rank=1, num_scheduled_tokens=0
INFO 11-14 20:48:13 [gpu_model_runner.py:2533] rank=2, num_scheduled_tokens=0
INFO 11-14 20:48:13 [gpu_model_runner.py:2533] rank=3, num_scheduled_tokens=0
```

Differential Revision: D87131186
@bangshengtang
Copy link
Collaborator Author

cc: @njhill @WoosukKwon

@22quinn 22quinn requested review from WoosukKwon and njhill November 15, 2025 07:38
@njhill njhill mentioned this pull request Nov 15, 2025
19 tasks
Copy link
Collaborator

@houseroad houseroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@houseroad houseroad added ready-for-merge Indicate this PR is ready to be merged by the maintainers, used by reviewers without merge access. ready ONLY add when PR is ready to merge/full CI is needed labels Nov 17, 2025
@njhill njhill changed the title fix a corner case that could cause out-of-sync with async scheduling and dp >1 [BugFix] Corner case that could cause out-of-sync with async scheduling and dp >1 Nov 17, 2025
@njhill
Copy link
Member

njhill commented Nov 17, 2025

Thanks @bangshengtang, reviewing this now...

Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bangshengtang do you have a reproducer for this? We do have CI coverage for DP which seems to work fine with async scheduling enabled. Is this a race condition that's hard to trigger?

I'm trying to understand the condition in which this transpires, since I think it should be covered here:

vllm/vllm/v1/engine/core.py

Lines 1226 to 1237 in e42bd8c

executed = self._process_engine_step()
self._maybe_publish_request_counts()
local_unfinished_reqs = self.scheduler.has_unfinished_requests()
if not executed:
if not local_unfinished_reqs and not self.engines_running:
# All engines are idle.
continue
# We are in a running state and so must execute a dummy pass
# if the model didn't execute any ready requests.
self.execute_dummy_batch()

I'd like to see whether there's a fix which can be made in the engine core loop orchestration since that's what currently controls the dummy runs or else it's split between multiple places.

FWIW though, with the upcoming V2 model runner refactor I think we will be moving this into the runner.

bangshengtang added a commit to bangshengtang/vllm that referenced this pull request Nov 17, 2025
…and dp >1 (vllm-project#28774)

Summary:

we observed a comms issue when both async scheduling and DP>1 are enabled

when async scheduling is enabled, num_scheduled_tokens can become zero, and DP>1 requires coordinate_batch_across_dp to be called into where an all_reduce across dp ranks is required. returning early would cause dp ranks go out-of-sync

basically we got
```
terminate called after throwing an instance of 'gloo::EnforceNotMet'
  what():  [enforce fail at /mnt/code/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:454] op.preamble.length <= op.nbytes. 8 vs 4
I1114 16:43:23 native_sampler.py:491 1199692:MainThread] rank=2: has_unfinished_requests
W1114 16:43:23.344583 1233050 ExceptionTracer.cpp:193] Invalid trace stack for exception of type: gloo::EnforceNotMet
terminate called after throwing an instance of 'gloo::EnforceNotMet'
  what():  [enforce fail at /mnt/code/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:454] op.preamble.length <= op.nbytes. 8 vs 4
```
when dp_rank 0 is calling into has_unfinished_requests(), while dp_rank 1 is waiting at coordinate_batch_across_dp()

and log indicate that this happens in the previous step dp_rank 0 returned early with num_scheduled_tokens=0

Test Plan:
with the change, our workload runs for multiple ours without triggering this issue.

**before**
```
INFO 11-14 19:28:01 [gpu_model_runner.py:2533] rank=2, num_scheduled_tokens=0
INFO 11-14 19:28:01 [gpu_model_runner.py:2533] rank=3, num_scheduled_tokens=0
ERROR 11-14 19:28:01 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=CachedRequestData(req_ids=[], resumed_req_ids=[], new_token_ids=[], all_token_ids={}, new_block_ids=[], num_computed_tokens=[], num_output_tokens=[]), num_scheduled_tokens={}, total_num_scheduled_tokens=0, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 0, 0, 3], finished_req_ids=[], free_encoder_mm_hashes=[], pending_structured_output_tokens=false, kv_connector_metadata=null)
ERROR 11-14 19:28:01 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=CachedRequestData(req_ids=[], resumed_req_ids=[], new_token_ids=[], all_token_ids={}, new_block_ids=[], num_computed_tokens=[], num_output_tokens=[]), num_scheduled_tokens={}, total_num_scheduled_tokens=0, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 0, 0, 3], finished_req_ids=[], free_encoder_mm_hashes=[], pending_structured_output_tokens=false, kv_connector_metadata=null)
```
the server crashes immediately after the first num_scheduled_tokens=0 case

**after**
```
INFO 11-14 19:42:55 [gpu_model_runner.py:2533] rank=1, num_scheduled_tokens=0
INFO 11-14 19:42:55 [gpu_model_runner.py:2533] rank=0, num_scheduled_tokens=0
INFO 11-14 19:47:23 [gpu_model_runner.py:2533] rank=1, num_scheduled_tokens=0
INFO 11-14 19:47:23 [gpu_model_runner.py:2533] rank=0, num_scheduled_tokens=0
INFO 11-14 20:03:04 [gpu_model_runner.py:2533] rank=3, num_scheduled_tokens=0
INFO 11-14 20:03:04 [gpu_model_runner.py:2533] rank=2, num_scheduled_tokens=0
INFO 11-14 20:31:43 [gpu_model_runner.py:2533] rank=1, num_scheduled_tokens=0
INFO 11-14 20:31:43 [gpu_model_runner.py:2533] rank=0, num_scheduled_tokens=0
INFO 11-14 20:35:39 [gpu_model_runner.py:2533] rank=1, num_scheduled_tokens=0
INFO 11-14 20:35:39 [gpu_model_runner.py:2533] rank=0, num_scheduled_tokens=0
INFO 11-14 20:48:08 [gpu_model_runner.py:2533] rank=0, num_scheduled_tokens=0
INFO 11-14 20:48:08 [gpu_model_runner.py:2533] rank=1, num_scheduled_tokens=0
INFO 11-14 20:48:13 [gpu_model_runner.py:2533] rank=2, num_scheduled_tokens=0
INFO 11-14 20:48:13 [gpu_model_runner.py:2533] rank=3, num_scheduled_tokens=0
```

Reviewed By: houseroad

Differential Revision: D87131186
…and dp >1 (vllm-project#28774)

Summary:

we observed a comms issue when both async scheduling and DP>1 are enabled

when async scheduling is enabled, num_scheduled_tokens can become zero, and DP>1 requires coordinate_batch_across_dp to be called into where an all_reduce across dp ranks is required. returning early would cause dp ranks go out-of-sync

basically we got
```
terminate called after throwing an instance of 'gloo::EnforceNotMet'
  what():  [enforce fail at /mnt/code/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:454] op.preamble.length <= op.nbytes. 8 vs 4
I1114 16:43:23 native_sampler.py:491 1199692:MainThread] rank=2: has_unfinished_requests
W1114 16:43:23.344583 1233050 ExceptionTracer.cpp:193] Invalid trace stack for exception of type: gloo::EnforceNotMet
terminate called after throwing an instance of 'gloo::EnforceNotMet'
  what():  [enforce fail at /mnt/code/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:454] op.preamble.length <= op.nbytes. 8 vs 4
```
when dp_rank 0 is calling into has_unfinished_requests(), while dp_rank 1 is waiting at coordinate_batch_across_dp()

and log indicate that this happens in the previous step dp_rank 0 returned early with num_scheduled_tokens=0

Test Plan:
with the change, our workload runs for multiple ours without triggering this issue.

**before**
```
INFO 11-14 19:28:01 [gpu_model_runner.py:2533] rank=2, num_scheduled_tokens=0
INFO 11-14 19:28:01 [gpu_model_runner.py:2533] rank=3, num_scheduled_tokens=0
ERROR 11-14 19:28:01 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=CachedRequestData(req_ids=[], resumed_req_ids=[], new_token_ids=[], all_token_ids={}, new_block_ids=[], num_computed_tokens=[], num_output_tokens=[]), num_scheduled_tokens={}, total_num_scheduled_tokens=0, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 0, 0, 3], finished_req_ids=[], free_encoder_mm_hashes=[], pending_structured_output_tokens=false, kv_connector_metadata=null)
ERROR 11-14 19:28:01 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=CachedRequestData(req_ids=[], resumed_req_ids=[], new_token_ids=[], all_token_ids={}, new_block_ids=[], num_computed_tokens=[], num_output_tokens=[]), num_scheduled_tokens={}, total_num_scheduled_tokens=0, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 0, 0, 3], finished_req_ids=[], free_encoder_mm_hashes=[], pending_structured_output_tokens=false, kv_connector_metadata=null)
```
the server crashes immediately after the first num_scheduled_tokens=0 case

**after**
```
INFO 11-14 19:42:55 [gpu_model_runner.py:2533] rank=1, num_scheduled_tokens=0
INFO 11-14 19:42:55 [gpu_model_runner.py:2533] rank=0, num_scheduled_tokens=0
INFO 11-14 19:47:23 [gpu_model_runner.py:2533] rank=1, num_scheduled_tokens=0
INFO 11-14 19:47:23 [gpu_model_runner.py:2533] rank=0, num_scheduled_tokens=0
INFO 11-14 20:03:04 [gpu_model_runner.py:2533] rank=3, num_scheduled_tokens=0
INFO 11-14 20:03:04 [gpu_model_runner.py:2533] rank=2, num_scheduled_tokens=0
INFO 11-14 20:31:43 [gpu_model_runner.py:2533] rank=1, num_scheduled_tokens=0
INFO 11-14 20:31:43 [gpu_model_runner.py:2533] rank=0, num_scheduled_tokens=0
INFO 11-14 20:35:39 [gpu_model_runner.py:2533] rank=1, num_scheduled_tokens=0
INFO 11-14 20:35:39 [gpu_model_runner.py:2533] rank=0, num_scheduled_tokens=0
INFO 11-14 20:48:08 [gpu_model_runner.py:2533] rank=0, num_scheduled_tokens=0
INFO 11-14 20:48:08 [gpu_model_runner.py:2533] rank=1, num_scheduled_tokens=0
INFO 11-14 20:48:13 [gpu_model_runner.py:2533] rank=2, num_scheduled_tokens=0
INFO 11-14 20:48:13 [gpu_model_runner.py:2533] rank=3, num_scheduled_tokens=0
```

Reviewed By: houseroad

Differential Revision: D87131186
Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bangshengtang. Per slack discussion we can include this for now as a fix for external launcher + DP.

You need to sign-off your commit for the DCO though: https://github.com/vllm-project/vllm/pull/28774/checks?check_run_id=55627622958

@njhill njhill changed the title [BugFix] Corner case that could cause out-of-sync with async scheduling and dp >1 [BugFix] Corner case that could cause out-of-sync with external launcher mode and dp >1 Nov 17, 2025
@zhuohan123 zhuohan123 enabled auto-merge (squash) November 17, 2025 19:48
@zhuohan123 zhuohan123 disabled auto-merge November 17, 2025 23:22
@zhuohan123 zhuohan123 merged commit 6148584 into vllm-project:main Nov 17, 2025
43 of 45 checks passed
Victor49152 pushed a commit to Victor49152/vllm that referenced this pull request Nov 20, 2025
bigPYJ1151 pushed a commit that referenced this pull request Nov 25, 2025
bringlein pushed a commit to bringlein/vllm that referenced this pull request Nov 26, 2025
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed ready-for-merge Indicate this PR is ready to be merged by the maintainers, used by reviewers without merge access. v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants