Skip to content

[CI] Split Distributed Tests (4 GPUs) into 3 parallel jobs#37015

Merged
DarkLight1337 merged 2 commits intovllm-project:mainfrom
khluu:ci/split-distributed-4gpu
Mar 14, 2026
Merged

[CI] Split Distributed Tests (4 GPUs) into 3 parallel jobs#37015
DarkLight1337 merged 2 commits intovllm-project:mainfrom
khluu:ci/split-distributed-4gpu

Conversation

@khluu
Copy link
Collaborator

@khluu khluu commented Mar 13, 2026

Summary

  • Split the single ~1h09m "Distributed Tests (4 GPUs)" job into three smaller jobs targeting ~20m each
  • Distributed Torchrun + Examples (4 GPUs) (~17m): torchrun tests + data_parallel + rlhf examples
  • Distributed DP Tests (4 GPUs) (~20m): DP pytest tests (async_llm_dp, eagle_dp, external/internal/hybrid_lb, engine_core_client, test_utils)
  • Distributed Compile + Comm (4 GPUs) (~25m): compile/fullgraph + pynccl + events + symm_mem + multiproc_executor

Test plan

  • Verify all three new jobs pass in CI
  • Confirm no tests are missing compared to the original single job
  • Check that source_file_dependencies are correctly split across jobs

AI assistance was used (Claude). This is not duplicating any existing PR.

🤖 Generated with Claude Code

Split the single ~1h09m "Distributed Tests (4 GPUs)" job into three
smaller jobs targeting ~20m each:

- Distributed Torchrun + Examples (4 GPUs) (~17m)
- Distributed DP Tests (4 GPUs) (~20m)
- Distributed Compile + Comm (4 GPUs) (~25m)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mergify mergify bot added the ci/build label Mar 13, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a great improvement for CI efficiency, splitting a large distributed test job into three smaller, parallel jobs. The overall structure of the split is logical. However, I've identified several issues with the source_file_dependencies in the new jobs. These include missing dependencies, incorrect file paths, and missing file extensions, which could prevent CI from running when relevant files are changed. Addressing these is critical for CI correctness.

Comment on lines 58 to 62
- vllm/distributed/
- tests/distributed/test_utils
- tests/distributed/test_pynccl
- tests/distributed/test_events
- tests/compile/fullgraph/test_basic_correctness.py
- examples/offline_inference/rlhf.py
- examples/offline_inference/rlhf_colocate.py
- examples/offline_inference/new_weight_syncing/
- tests/examples/offline_inference/data_parallel.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The source_file_dependencies for this job have a few issues that could cause it to not run when expected:

  1. The job runs distributed/test_torchrun_example.py and distributed/test_torchrun_example_moe.py, but these are not listed as dependencies.
  2. The dependency tests/examples/offline_inference/data_parallel.py seems incorrect. The command runs ../examples/offline_inference/data_parallel.py, so the dependency should likely be on examples/offline_inference/data_parallel.py.

Please update the dependencies to ensure this job is triggered correctly.

  - vllm/distributed/
  - tests/distributed/test_torchrun_example.py
  - tests/distributed/test_torchrun_example_moe.py
  - examples/offline_inference/rlhf.py
  - examples/offline_inference/rlhf_colocate.py
  - examples/offline_inference/new_weight_syncing/
  - examples/offline_inference/data_parallel.py

- vllm/distributed/
- tests/v1/distributed
- tests/v1/engine/test_engine_core_client.py
- tests/distributed/test_utils
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The dependency tests/distributed/test_utils is missing the .py extension. The test file is tests/distributed/test_utils.py. Please correct the path to ensure changes to this file trigger the job.

  - tests/distributed/test_utils.py

Comment on lines +115 to +116
- tests/distributed/test_pynccl
- tests/distributed/test_events
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The dependencies tests/distributed/test_pynccl and tests/distributed/test_events are missing the .py extension. The test files are tests/distributed/test_pynccl.py and tests/distributed/test_events.py respectively. Please correct the paths to ensure changes to these files trigger the job.

  - tests/distributed/test_pynccl.py
  - tests/distributed/test_events.py

@khluu khluu added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 13, 2026
Add tests/distributed/test_torchrun_example.py and
test_torchrun_example_moe.py to source_file_dependencies so the
job triggers when those test files are edited.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@khluu
Copy link
Collaborator Author

khluu commented Mar 14, 2026

Distributed Torchrun + examples: 18m
Distributed DP Tests: 32m
Distributed Compile + Comm: 32m

Total is 82 minutes, 13 minutes more than original job due to overhead

@DarkLight1337 DarkLight1337 merged commit 74fe80e into vllm-project:main Mar 14, 2026
22 checks passed
siewcapital pushed a commit to siewcapital/vllm that referenced this pull request Mar 15, 2026
…ect#37015)

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Siew's Capital Jarvis <brayden.stanley.0127@gmail.com>
athrael-soju pushed a commit to athrael-soju/vllm that referenced this pull request Mar 15, 2026
…ect#37015)

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Athrael Soju <athrael.soju@gmail.com>
athrael-soju pushed a commit to athrael-soju/vllm that referenced this pull request Mar 16, 2026
…ect#37015)

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Athrael Soju <athrael.soju@gmail.com>
Lucaskabela pushed a commit to Lucaskabela/vllm that referenced this pull request Mar 17, 2026
wendyliu235 pushed a commit to wendyliu235/vllm-public that referenced this pull request Mar 18, 2026
…ect#37015)

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: wendyliu235 <wenjun.liu@intel.com>
fxdawnn pushed a commit to fxdawnn/vllm that referenced this pull request Mar 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants