[CI/Testing] Add basic single node dual batch overlap test#27235
[CI/Testing] Add basic single node dual batch overlap test#27235tlrmchlsmth merged 4 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a new test for Dual Batch Overlap (DBO) with Data Parallelism and Expert Parallelism. The test is well-structured, using a GSM8K evaluation to verify correctness on a multi-GPU single-node setup. The CI configuration is also updated to run this test on H100 GPUs. My review found one high-severity issue related to missing test dependencies in the CI configuration, which could lead to the test not running when its helper utilities are modified. Otherwise, the changes are solid and a good addition to the test suite.
.buildkite/test-pipeline.yaml
Outdated
| source_file_dependencies: | ||
| - docker/Dockerfile # To catch DeepEP updates | ||
| - vllm/model_executor/layers/fused_moe | ||
| - vllm/distributed/device_communicators | ||
| - vllm/v1/worker/ | ||
| - vllm/v1/attention/backends/utils.py |
There was a problem hiding this comment.
The source_file_dependencies list is missing dependencies on the test utility files used by tests/v1/distributed/test_dbo.py. The test imports from tests.evals.gsm8k.gsm8k_eval and tests.utils. Changes to these files could affect the test's behavior or correctness, but they won't trigger this test run. Please add them to the dependency list to ensure the test is run when its dependencies change.
source_file_dependencies:
- tests/evals/gsm8k/gsm8k_eval.py
- tests/utils.py
- docker/Dockerfile # To catch DeepEP updates
- vllm/model_executor/layers/fused_moe
- vllm/distributed/device_communicators
- vllm/v1/worker/
- vllm/v1/attention/backends/utils.pyThere was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
.buildkite/test-pipeline.yaml
Outdated
| - label: Distributed Tests (H100) # optional | ||
| gpu: h100 | ||
| working_dir: "/vllm-workspace/" | ||
| num_gpus: 2 | ||
| commands: | ||
| - pytest -v -s tests/v1/distributed/test_dbo.py | ||
| source_file_dependencies: |
There was a problem hiding this comment.
The new H100 pipeline step is commented as optional but the Buildkite block doesn’t set optional: true. Without that flag Buildkite will treat the step as required, so every CI run now waits for an H100 agent even when the queue has none available. This effectively blocks the pipeline whenever H100 hardware isn’t provisioned, defeating the stated intent of having an optional dual batch overlap test.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Seems like a reasonable suggestion
There was a problem hiding this comment.
Moved it to B200 and H200 nightly 👍 (per suggestion from @mgoin)
SageMoore
left a comment
There was a problem hiding this comment.
Thanks for the test @LucasWilkinson
| # Note: Not using --enforce-eager to test DBO's alternate CUDA graph dispatching | ||
| "--data-parallel-size", str(DP_SIZE), | ||
| "--enable-expert-parallel", | ||
| "--enable-dbo", |
There was a problem hiding this comment.
Do we want to drop the decode threshold as well?
There was a problem hiding this comment.
we could; I already verified that we hit cases above and below both thresholds but probably good to fix them so if they get updated we don't suddenly start testing no-DBO
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
53c444d to
d498913
Compare
…ect#27235) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
…ect#27235) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Ideally we'd do a multi-node test; but add a single node test for now to make sure we atleast get some coverage
Not sure exactly which test suite to put it in; uses DeepEP so needs to be run on hopper or Blackwell