[CI][torch nightlies] Use main Dockerfile with flags for nightly torch tests#30443
[CI][torch nightlies] Use main Dockerfile with flags for nightly torch tests#30443simon-mo merged 5 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the Docker build process by removing the dedicated Dockerfile.nightly_torch and integrating its functionality into the main Dockerfile using a PYTORCH_NIGHTLY build argument. This is a good simplification. However, I've found a critical issue in the implementation of the nightly build logic within the main Dockerfile. The command for installing nightly PyTorch will likely fail due to version conflicts and an incorrect package index URL. I've provided a detailed comment with a suggested fix to address this.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
https://github.com/vllm-project/vllm/blob/0f56bd5e5179ae96789877a0cf16fc032e601ea8/docker/Dockerfile#L284-L287
Nightly builds still pin release torch after repo copy
When building with PYTORCH_NIGHTLY=1, the base stage installs nightly torch and strips torch entries from the copied requirement files, but this COPY . . reintroduces the original pyproject.toml/requirements that pin torch == 2.9.0. Because use_existing_torch.py is not run after this copy, the wheel built in this stage still declares the release torch dependency, so downstream uv pip install dist/*.whl will reinstall/downgrade to the stable torch and the nightly flag has no effect. Re-run the cleanup after the full repo copy or avoid copying the pinned files when building nightlies.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
https://github.com/vllm-project/vllm/blob/3418f6d59c06b6947d2a282707a8ad284b0e8d3f/docker/Dockerfile#L276-L278
Restore build.txt copy before wheel deps install
The wheel build stage now runs uv pip install ... -r requirements/build.txt, but this stage never copies that file anymore (base only brings in requirements/common.txt and requirements/cuda.txt, and no COPY precedes this RUN). As a result, docker build will hit a "requirements/build.txt: no such file or directory" failure when building the build stage, preventing image creation.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
@codex review |
|
Codex Review: Something went wrong. Try again later by commenting “@codex review”. ℹ️ About Codex in GitHubCodex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback". |
|
Hi @orionr, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
This pull request has merge conflicts that must be resolved before it can be |
cc6a0ed to
1ce9ed5
Compare
|
Hi @orionr, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
1ce9ed5 to
7e9de52
Compare
|
Code changes should be ready for review after the final Buildkite test runs. And now done. |
|
@orionr What is the main motivation for this change? I'm worried it will impact the cache invalidation negatively. For starters, can you add a dummy commit and see how long your rebuild takes and how it affects the cache hits? |
@amrmahdi if you look at https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.nightly_torch it has diverged heavily from the core Dockerfile and no longer functional - it's challenging to keep them in sync and provide the benefits added to the core Dockerfile. Generally, I think having these be hardware-specific (Nvidia, AMD, CPU, etc) makes sense, but splitting on library version less so. Also, this sets the foundation to simplify cross-repo releases between PyTorch and vLLM, where there is even more usage of torch.compile and other bits for performance. I have the same concerns around the caching, which is why I tagged you here and on the other PR. Let me start a chat with you about concerns here as well as on the ci-infra PR. Edit: Kicked off a rebuild check at https://buildkite.com/vllm/ci/builds/45958/steps/canvas?sid=019b998b-6054-43ef-bab5-59e97e07fd6c. The previous build times on https://buildkite.com/vllm/ci/builds/45736/steps/canvas?sid=019b9459-4390-4709-bc4b-6f75c375c41a were build image (2h 37m) and build image torch nightly (4h 24m) so let's see. Edit 2: Rebuild times show build image (24m 15s) and build image torch nightly (did a C++/CUDA rebuild, still going at 30+ minutes). Let's focus on build image, since we can optimize build image torch nightly later. This is good, but not down to the ~12-16 minutes that I think you were seeing @amrmahdi with your work? Did we regress something here? Edit 3: With addition of just a txt file we are at build image (11m 47s) which is much better |
|
Hi @orionr, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
cc @Harry-Chen |
|
Heads up that we'd like to land this after I do some final rebuild and cache testing, which should be done by tomorrow. @amrmahdi and I had a good chat today and I'll also add some notes here about directions for standardization and improvements in the future. |
Signed-off-by: Orion Reblitz-Richardson <orionr@meta.com>
|
After rebasing the build image took 21m and build image torch nightly took 7h 35m - both as expected and the same as before a rebase. Ultimately, the torch nightly build succeeded at https://buildkite.com/vllm/ci/builds/47099/steps/canvas?sid=019bbeab-8c6c-4773-9271-bf77f9e02737 and three downstream tests had failures, which we'll fix separately. Much better than the build failing. Also, I confirmed that the rebuild image, with a simple text file addition, took ~8m at https://buildkite.com/vllm/ci/builds/47229/steps/canvas?sid=019bc28f-ef90-49e4-a9d4-75467706a434. FYI to @amrmahdi @atalman . With all of that, we're good to go! After we land this PR, here are some thoughts on next steps:
|
Signed-off-by: Orion Reblitz-Richardson <orionr@meta.com>
|
Hi @orionr, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
@khluu I just ran Running from a different location (MacBook Pro in this case) seems to work. |
Head branch was pushed to by a user without write access
|
Documentation preview: https://vllm--30443.org.readthedocs.build/en/30443/ |
Signed-off-by: Orion Reblitz-Richardson <orionr@gmail.com>
69fdad8 to
f1851bb
Compare
…h tests (vllm-project#30443) Signed-off-by: Orion Reblitz-Richardson <orionr@meta.com> Signed-off-by: Orion Reblitz-Richardson <orionr@gmail.com> Co-authored-by: Kevin H. Luu <khluu000@gmail.com> Signed-off-by: 陈建华 <1647430658@qq.com>
…h tests (vllm-project#30443) Signed-off-by: Orion Reblitz-Richardson <orionr@meta.com> Signed-off-by: Orion Reblitz-Richardson <orionr@gmail.com> Co-authored-by: Kevin H. Luu <khluu000@gmail.com>
…h tests (vllm-project#30443) Signed-off-by: Orion Reblitz-Richardson <orionr@meta.com> Signed-off-by: Orion Reblitz-Richardson <orionr@gmail.com> Co-authored-by: Kevin H. Luu <khluu000@gmail.com>
Use standard Docker image instead of torch_nightly image for PyTorch nightlies testing and CI runs.
Moving this from vllm-project/ci-infra#239 to a branch on upstream for testing purposes outlined at https://github.com/vllm-project/ci-infra?tab=readme-ov-file#how-to-test-changes-in-this-repo
Tests to confirm:
vllmfork matching HEAD, noci-infrachanges) at https://buildkite.com/vllm/ci/builds/42874/steps/canvas. Allowed 5 test runs to move forward. -> Seems like PT nightlies build itself failed on installingflashinferso all tests failed afterwards.vllmchanges at [CI][torch nightlies] Use main Dockerfile with flags for nightly torch tests #30443, myci-infrachanges at [torch nightlies] Use main Dockerfile with flags for nightly torch tests ci-infra#244) with a successful build at https://buildkite.com/vllm/ci/builds/45736/steps/canvas?sid=019b9459-43ce-46d3-99c2-c10a1a8ce96c. One downstream test is failing, but that looks real and something we will investigate.We will remove https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.nightly_torch in a separate commit.
Looking for review and landing with help from @khluu, @amrmahdi, @atalman, @huydhn . Thanks!
After we land this PR, here are some thoughts on next steps:
Note
Cursor Bugbot is generating a summary for commit f8f0e53f6e9e547196cd1cc11218cef56c07851f. Configure here.
Note
Moves PyTorch nightly support into the main
docker/Dockerfilevia aPYTORCH_NIGHTLY=1flag and enforces consistent torch lib versions across all build/test stages.PYTORCH_NIGHTLYARG to install nightlytorch/torchvision/torchaudioand switches indexes accordinglyuse_existing_torch.pyto strip torch deps fromrequirements/*andpyproject.toml(prefix-aware)torch_lib_versions.txtand validates them incsrc-build,build,vllm-base, andteststagesdocker/Dockerfile.nightly_torchas deprecated; use the main Dockerfile withPYTORCH_NIGHTLY=1Written by Cursor Bugbot for commit f8f0e53f6e9e547196cd1cc11218cef56c07851f. This will update automatically on new commits. Configure here.
Note
Cursor Bugbot is generating a summary for commit c00a0691687a77994bd613ce4e507f5541096c60. Configure here.
Note
Moves PyTorch nightly support into the main
docker/DockerfilebehindPYTORCH_NIGHTLY=1, with consistent torch lib versioning across all stages.PYTORCH_NIGHTLYARG and switches installs to nightly indexes when set; otherwise keeps release flowuse_existing_torch.pyto striptorch*deps fromrequirements/*andpyproject.toml(prefix-aware)torch/torchvision/torchaudiointotorch_lib_versions.txtand reuses them incsrc-build,build,dev,vllm-base, andteststages to avoid version driftdocker/Dockerfile.nightly_torchas deprecated in favor of the main Dockerfile + flagWritten by Cursor Bugbot for commit c00a0691687a77994bd613ce4e507f5541096c60. This will update automatically on new commits. Configure here.