Commit 25be145
fix a corner case that could cause out-of-sync with async scheduling and dp >1 (#28774)
Summary:
Pull Request resolved: #28774
we observed a comms issue when both async scheduling and DP>1 are enabled
when async scheduling is enabled, num_scheduled_tokens can become zero, and DP>1 requires coordinate_batch_across_dp to be called into where an all_reduce across dp ranks is required. returning early would cause dp ranks go out-of-sync
basically we got
```
terminate called after throwing an instance of 'gloo::EnforceNotMet'
what(): [enforce fail at /mnt/code/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:454] op.preamble.length <= op.nbytes. 8 vs 4
I1114 16:43:23 native_sampler.py:491 1199692:MainThread] rank=2: has_unfinished_requests
W1114 16:43:23.344583 1233050 ExceptionTracer.cpp:193] Invalid trace stack for exception of type: gloo::EnforceNotMet
terminate called after throwing an instance of 'gloo::EnforceNotMet'
what(): [enforce fail at /mnt/code/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:454] op.preamble.length <= op.nbytes. 8 vs 4
```
when dp_rank 0 is calling into has_unfinished_requests(), while dp_rank 1 is waiting at coordinate_batch_across_dp()
and log indicate that this happens in the previous step dp_rank 0 returned early with num_scheduled_tokens=0
Test Plan:
with the change, our workload runs for multiple ours without triggering this issue.
**before**
```
INFO 11-14 19:28:01 [gpu_model_runner.py:2533] rank=2, num_scheduled_tokens=0
INFO 11-14 19:28:01 [gpu_model_runner.py:2533] rank=3, num_scheduled_tokens=0
ERROR 11-14 19:28:01 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=CachedRequestData(req_ids=[], resumed_req_ids=[], new_token_ids=[], all_token_ids={}, new_block_ids=[], num_computed_tokens=[], num_output_tokens=[]), num_scheduled_tokens={}, total_num_scheduled_tokens=0, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 0, 0, 3], finished_req_ids=[], free_encoder_mm_hashes=[], pending_structured_output_tokens=false, kv_connector_metadata=null)
ERROR 11-14 19:28:01 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=CachedRequestData(req_ids=[], resumed_req_ids=[], new_token_ids=[], all_token_ids={}, new_block_ids=[], num_computed_tokens=[], num_output_tokens=[]), num_scheduled_tokens={}, total_num_scheduled_tokens=0, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 0, 0, 3], finished_req_ids=[], free_encoder_mm_hashes=[], pending_structured_output_tokens=false, kv_connector_metadata=null)
```
the server crashes immediately after the first num_scheduled_tokens=0 case
**after**
```
INFO 11-14 19:42:55 [gpu_model_runner.py:2533] rank=1, num_scheduled_tokens=0
INFO 11-14 19:42:55 [gpu_model_runner.py:2533] rank=0, num_scheduled_tokens=0
INFO 11-14 19:47:23 [gpu_model_runner.py:2533] rank=1, num_scheduled_tokens=0
INFO 11-14 19:47:23 [gpu_model_runner.py:2533] rank=0, num_scheduled_tokens=0
INFO 11-14 20:03:04 [gpu_model_runner.py:2533] rank=3, num_scheduled_tokens=0
INFO 11-14 20:03:04 [gpu_model_runner.py:2533] rank=2, num_scheduled_tokens=0
INFO 11-14 20:31:43 [gpu_model_runner.py:2533] rank=1, num_scheduled_tokens=0
INFO 11-14 20:31:43 [gpu_model_runner.py:2533] rank=0, num_scheduled_tokens=0
INFO 11-14 20:35:39 [gpu_model_runner.py:2533] rank=1, num_scheduled_tokens=0
INFO 11-14 20:35:39 [gpu_model_runner.py:2533] rank=0, num_scheduled_tokens=0
INFO 11-14 20:48:08 [gpu_model_runner.py:2533] rank=0, num_scheduled_tokens=0
INFO 11-14 20:48:08 [gpu_model_runner.py:2533] rank=1, num_scheduled_tokens=0
INFO 11-14 20:48:13 [gpu_model_runner.py:2533] rank=2, num_scheduled_tokens=0
INFO 11-14 20:48:13 [gpu_model_runner.py:2533] rank=3, num_scheduled_tokens=0
```
Differential Revision: D871311861 parent 681bd93 commit 25be145
1 file changed
+11
-0
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2534 | 2534 | | |
2535 | 2535 | | |
2536 | 2536 | | |
| 2537 | + | |
| 2538 | + | |
| 2539 | + | |
| 2540 | + | |
| 2541 | + | |
| 2542 | + | |
| 2543 | + | |
| 2544 | + | |
| 2545 | + | |
| 2546 | + | |
| 2547 | + | |
2537 | 2548 | | |
2538 | 2549 | | |
2539 | 2550 | | |
| |||
0 commit comments