[CI] Split Distributed Tests (4 GPUs) into 3 parallel jobs#37015
[CI] Split Distributed Tests (4 GPUs) into 3 parallel jobs#37015DarkLight1337 merged 2 commits intovllm-project:mainfrom
Conversation
Split the single ~1h09m "Distributed Tests (4 GPUs)" job into three smaller jobs targeting ~20m each: - Distributed Torchrun + Examples (4 GPUs) (~17m) - Distributed DP Tests (4 GPUs) (~20m) - Distributed Compile + Comm (4 GPUs) (~25m) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request is a great improvement for CI efficiency, splitting a large distributed test job into three smaller, parallel jobs. The overall structure of the split is logical. However, I've identified several issues with the source_file_dependencies in the new jobs. These include missing dependencies, incorrect file paths, and missing file extensions, which could prevent CI from running when relevant files are changed. Addressing these is critical for CI correctness.
| - vllm/distributed/ | ||
| - tests/distributed/test_utils | ||
| - tests/distributed/test_pynccl | ||
| - tests/distributed/test_events | ||
| - tests/compile/fullgraph/test_basic_correctness.py | ||
| - examples/offline_inference/rlhf.py | ||
| - examples/offline_inference/rlhf_colocate.py | ||
| - examples/offline_inference/new_weight_syncing/ | ||
| - tests/examples/offline_inference/data_parallel.py |
There was a problem hiding this comment.
The source_file_dependencies for this job have a few issues that could cause it to not run when expected:
- The job runs
distributed/test_torchrun_example.pyanddistributed/test_torchrun_example_moe.py, but these are not listed as dependencies. - The dependency
tests/examples/offline_inference/data_parallel.pyseems incorrect. The command runs../examples/offline_inference/data_parallel.py, so the dependency should likely be onexamples/offline_inference/data_parallel.py.
Please update the dependencies to ensure this job is triggered correctly.
- vllm/distributed/
- tests/distributed/test_torchrun_example.py
- tests/distributed/test_torchrun_example_moe.py
- examples/offline_inference/rlhf.py
- examples/offline_inference/rlhf_colocate.py
- examples/offline_inference/new_weight_syncing/
- examples/offline_inference/data_parallel.py| - vllm/distributed/ | ||
| - tests/v1/distributed | ||
| - tests/v1/engine/test_engine_core_client.py | ||
| - tests/distributed/test_utils |
| - tests/distributed/test_pynccl | ||
| - tests/distributed/test_events |
There was a problem hiding this comment.
The dependencies tests/distributed/test_pynccl and tests/distributed/test_events are missing the .py extension. The test files are tests/distributed/test_pynccl.py and tests/distributed/test_events.py respectively. Please correct the paths to ensure changes to these files trigger the job.
- tests/distributed/test_pynccl.py
- tests/distributed/test_events.pyAdd tests/distributed/test_torchrun_example.py and test_torchrun_example_moe.py to source_file_dependencies so the job triggers when those test files are edited. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Distributed Torchrun + examples: 18m Total is 82 minutes, 13 minutes more than original job due to overhead |
…ect#37015) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Siew's Capital Jarvis <brayden.stanley.0127@gmail.com>
…ect#37015) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Athrael Soju <athrael.soju@gmail.com>
…ect#37015) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Athrael Soju <athrael.soju@gmail.com>
…ect#37015) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…ect#37015) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: wendyliu235 <wenjun.liu@intel.com>
…ect#37015) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Summary
Test plan
AI assistance was used (Claude). This is not duplicating any existing PR.
🤖 Generated with Claude Code