Skip to content

[CI][torch nightlies] Use main Dockerfile with flags for nightly torch tests#30443

Merged
simon-mo merged 5 commits intovllm-project:mainfrom
orionr:orionr/pt-nightlies
Jan 23, 2026
Merged

[CI][torch nightlies] Use main Dockerfile with flags for nightly torch tests#30443
simon-mo merged 5 commits intovllm-project:mainfrom
orionr:orionr/pt-nightlies

Conversation

@orionr
Copy link
Copy Markdown
Contributor

@orionr orionr commented Dec 11, 2025

Use standard Docker image instead of torch_nightly image for PyTorch nightlies testing and CI runs.

Moving this from vllm-project/ci-infra#239 to a branch on upstream for testing purposes outlined at https://github.com/vllm-project/ci-infra?tab=readme-ov-file#how-to-test-changes-in-this-repo

Tests to confirm:

  1. Baseline (my vllm fork matching HEAD, no ci-infra changes) at https://buildkite.com/vllm/ci/builds/42874/steps/canvas. Allowed 5 test runs to move forward. -> Seems like PT nightlies build itself failed on installing flashinfer so all tests failed afterwards.
  2. New (my vllm changes at [CI][torch nightlies] Use main Dockerfile with flags for nightly torch tests #30443, my ci-infra changes at [torch nightlies] Use main Dockerfile with flags for nightly torch tests ci-infra#244) with a successful build at https://buildkite.com/vllm/ci/builds/45736/steps/canvas?sid=019b9459-43ce-46d3-99c2-c10a1a8ce96c. One downstream test is failing, but that looks real and something we will investigate.

We will remove https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.nightly_torch in a separate commit.

Looking for review and landing with help from @khluu, @amrmahdi, @atalman, @huydhn . Thanks!

After we land this PR, here are some thoughts on next steps:

  1. [WIP][CI][build] Move torch deps into requirements/torch.txt and torchlib.txt #32721 : We should explore moving all torch dep requirements into requirements/torch.txt (torch==...) and requirements/torchlibs.txt (torchaudio==..., torchtext==...) and cpu.txt, xpu.txt, test.in, etc would all -r that file. If we do that the use_existing_torch.py file would just need to clear or override that file to make sure we are referencing nightlies (update to torch_lib_versions.txt) or using the existing torch (make it a blank file). I might try and put up a separate PR around this with my ideas.
  2. There would still be some pip install flags adjusted for torch nightly vs release (--pre, --index-url vs --extra-index-url) but we could maybe get rid of the if [ "${PYTORCH_NIGHTLY}" = "1" ]; then in most cases and especially outside of the two base images. This would be good! I don't think we would need to use --constraint if the requirements/ are updated correctly, but we should check.
  3. We should look at caching torch nightly builds same as standard builds, but be careful about cache location and collisions with the torch release builds.

Note

Cursor Bugbot is generating a summary for commit f8f0e53f6e9e547196cd1cc11218cef56c07851f. Configure here.


Note

Moves PyTorch nightly support into the main docker/Dockerfile via a PYTORCH_NIGHTLY=1 flag and enforces consistent torch lib versions across all build/test stages.

  • Adds PYTORCH_NIGHTLY ARG to install nightly torch/torchvision/torchaudio and switches indexes accordingly
  • Introduces use_existing_torch.py to strip torch deps from requirements/* and pyproject.toml (prefix-aware)
  • Records installed torch lib versions in torch_lib_versions.txt and validates them in csrc-build, build, vllm-base, and test stages
  • Updates dev/test installs to pin/compile requirements against recorded torch versions when using nightlies
  • Marks docker/Dockerfile.nightly_torch as deprecated; use the main Dockerfile with PYTORCH_NIGHTLY=1

Written by Cursor Bugbot for commit f8f0e53f6e9e547196cd1cc11218cef56c07851f. This will update automatically on new commits. Configure here.


Note

Cursor Bugbot is generating a summary for commit c00a0691687a77994bd613ce4e507f5541096c60. Configure here.


Note

Moves PyTorch nightly support into the main docker/Dockerfile behind PYTORCH_NIGHTLY=1, with consistent torch lib versioning across all stages.

  • Adds PYTORCH_NIGHTLY ARG and switches installs to nightly indexes when set; otherwise keeps release flow
  • Introduces use_existing_torch.py to strip torch* deps from requirements/* and pyproject.toml (prefix-aware)
  • Records installed torch/torchvision/torchaudio into torch_lib_versions.txt and reuses them in csrc-build, build, dev, vllm-base, and test stages to avoid version drift
  • Dev/test installs now compile/pin requirements against recorded torch versions when using nightlies
  • Marks docker/Dockerfile.nightly_torch as deprecated in favor of the main Dockerfile + flag

Written by Cursor Bugbot for commit c00a0691687a77994bd613ce4e507f5541096c60. This will update automatically on new commits. Configure here.

@mergify mergify bot added the ci/build label Dec 11, 2025
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the Docker build process by removing the dedicated Dockerfile.nightly_torch and integrating its functionality into the main Dockerfile using a PYTORCH_NIGHTLY build argument. This is a good simplification. However, I've found a critical issue in the implementation of the nightly build logic within the main Dockerfile. The command for installing nightly PyTorch will likely fail due to version conflicts and an incorrect package index URL. I've provided a detailed comment with a suggested fix to address this.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@orionr orionr changed the title [PT nightlies] Remove nightly_torch Docker image and use standard [WIP][PT nightlies] Remove nightly_torch Docker image and use standard Dec 11, 2025
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

https://github.com/vllm-project/vllm/blob/0f56bd5e5179ae96789877a0cf16fc032e601ea8/docker/Dockerfile#L284-L287
P1 Badge Nightly builds still pin release torch after repo copy

When building with PYTORCH_NIGHTLY=1, the base stage installs nightly torch and strips torch entries from the copied requirement files, but this COPY . . reintroduces the original pyproject.toml/requirements that pin torch == 2.9.0. Because use_existing_torch.py is not run after this copy, the wheel built in this stage still declares the release torch dependency, so downstream uv pip install dist/*.whl will reinstall/downgrade to the stable torch and the nightly flag has no effect. Re-run the cleanup after the full repo copy or avoid copying the pinned files when building nightlies.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@orionr
Copy link
Copy Markdown
Contributor Author

orionr commented Dec 11, 2025

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

https://github.com/vllm-project/vllm/blob/3418f6d59c06b6947d2a282707a8ad284b0e8d3f/docker/Dockerfile#L276-L278
P1 Badge Restore build.txt copy before wheel deps install

The wheel build stage now runs uv pip install ... -r requirements/build.txt, but this stage never copies that file anymore (base only brings in requirements/common.txt and requirements/cuda.txt, and no COPY precedes this RUN). As a result, docker build will hit a "requirements/build.txt: no such file or directory" failure when building the build stage, preventing image creation.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@orionr
Copy link
Copy Markdown
Contributor Author

orionr commented Dec 11, 2025

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Something went wrong. Try again later by commenting “@codex review”.

We were unable to download your code in a timely manner.
ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Dec 15, 2025

Hi @orionr, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Dec 16, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @orionr.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Dec 16, 2025
@orionr orionr force-pushed the orionr/pt-nightlies branch from cc6a0ed to 1ce9ed5 Compare December 20, 2025 16:25
@mergify mergify bot removed the needs-rebase label Dec 20, 2025
@orionr orionr changed the title [WIP][PT nightlies] Remove nightly_torch Docker image and use standard [PT nightlies] Remove nightly_torch Docker image and use standard Dec 20, 2025
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Dec 20, 2025

Hi @orionr, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@orionr orionr force-pushed the orionr/pt-nightlies branch from 1ce9ed5 to 7e9de52 Compare January 6, 2026 17:18
@orionr orionr changed the title [PT nightlies] Remove nightly_torch Docker image and use standard [PT nightlies] Use main Dockerfile with flags for nightly torch tests Jan 6, 2026
@orionr
Copy link
Copy Markdown
Contributor Author

orionr commented Jan 6, 2026

Code changes should be ready for review after the final Buildkite test runs. And now done.

Copy link
Copy Markdown
Contributor

@atalman atalman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@amrmahdi
Copy link
Copy Markdown
Contributor

amrmahdi commented Jan 7, 2026

@orionr What is the main motivation for this change? I'm worried it will impact the cache invalidation negatively. For starters, can you add a dummy commit and see how long your rebuild takes and how it affects the cache hits?

@orionr
Copy link
Copy Markdown
Contributor Author

orionr commented Jan 7, 2026

@orionr What is the main motivation for this change? I'm worried it will impact the cache invalidation negatively. For starters, can you add a dummy commit and see how long your rebuild takes and how it affects the cache hits?

@amrmahdi if you look at https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.nightly_torch it has diverged heavily from the core Dockerfile and no longer functional - it's challenging to keep them in sync and provide the benefits added to the core Dockerfile. Generally, I think having these be hardware-specific (Nvidia, AMD, CPU, etc) makes sense, but splitting on library version less so. Also, this sets the foundation to simplify cross-repo releases between PyTorch and vLLM, where there is even more usage of torch.compile and other bits for performance.

I have the same concerns around the caching, which is why I tagged you here and on the other PR. Let me start a chat with you about concerns here as well as on the ci-infra PR.

Edit: Kicked off a rebuild check at https://buildkite.com/vllm/ci/builds/45958/steps/canvas?sid=019b998b-6054-43ef-bab5-59e97e07fd6c. The previous build times on https://buildkite.com/vllm/ci/builds/45736/steps/canvas?sid=019b9459-4390-4709-bc4b-6f75c375c41a were build image (2h 37m) and build image torch nightly (4h 24m) so let's see.

Edit 2: Rebuild times show build image (24m 15s) and build image torch nightly (did a C++/CUDA rebuild, still going at 30+ minutes). Let's focus on build image, since we can optimize build image torch nightly later. This is good, but not down to the ~12-16 minutes that I think you were seeing @amrmahdi with your work? Did we regress something here?

Edit 3: With addition of just a txt file we are at build image (11m 47s) which is much better

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Jan 7, 2026

Hi @orionr, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@khluu
Copy link
Copy Markdown
Collaborator

khluu commented Jan 14, 2026

cc @Harry-Chen

@orionr
Copy link
Copy Markdown
Contributor Author

orionr commented Jan 15, 2026

Heads up that we'd like to land this after I do some final rebuild and cache testing, which should be done by tomorrow. @amrmahdi and I had a good chat today and I'll also add some notes here about directions for standardization and improvements in the future.

Signed-off-by: Orion Reblitz-Richardson <orionr@meta.com>
@orionr
Copy link
Copy Markdown
Contributor Author

orionr commented Jan 15, 2026

After rebasing the build image took 21m and build image torch nightly took 7h 35m - both as expected and the same as before a rebase.

Ultimately, the torch nightly build succeeded at https://buildkite.com/vllm/ci/builds/47099/steps/canvas?sid=019bbeab-8c6c-4773-9271-bf77f9e02737 and three downstream tests had failures, which we'll fix separately. Much better than the build failing.

Also, I confirmed that the rebuild image, with a simple text file addition, took ~8m at https://buildkite.com/vllm/ci/builds/47229/steps/canvas?sid=019bc28f-ef90-49e4-a9d4-75467706a434. FYI to @amrmahdi @atalman .

With all of that, we're good to go! After we land this PR, here are some thoughts on next steps:

  1. [WIP][CI][build] Move torch deps into requirements/torch.txt and torchlib.txt #32721 : We should explore moving all torch dep requirements into requirements/torch.txt (torch==...) and requirements/torchlibs.txt (torchaudio==..., torchtext==...) and cpu.txt, xpu.txt, test.in, etc would all -r that file. If we do that the use_existing_torch.py file would just need to clear or override that file to make sure we are referencing nightlies (update to torch_lib_versions.txt) or using the existing torch (make it a blank file). I might try and put up a separate PR around this with my ideas.
  2. There would still be some pip install flags adjusted for torch nightly vs release (--pre, --index-url vs --extra-index-url) but we could maybe get rid of the if [ "${PYTORCH_NIGHTLY}" = "1" ]; then in most cases and especially outside of the two base images. This would be good! I don't think we would need to use --constraint if the requirements/ are updated correctly, but we should check.
  3. We should look at caching torch nightly builds same as standard builds, but be careful about cache location and collisions with the torch release builds.

Signed-off-by: Orion Reblitz-Richardson <orionr@meta.com>
@Harry-Chen Harry-Chen mentioned this pull request Jan 16, 2026
5 tasks
@khluu khluu added the ready-run-all-tests Trigger CI with all tests for wide-ranging PRs label Jan 16, 2026
@khluu khluu enabled auto-merge (squash) January 21, 2026 10:05
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 21, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Jan 21, 2026

Hi @orionr, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@orionr
Copy link
Copy Markdown
Contributor Author

orionr commented Jan 21, 2026

@khluu I just ran pre-commit run --all-files and everything passed locally with no file changes. Seems like the CI error (not seen locally) is

diff --git a/docs/assets/contributing/dockerfile-stages-dependency.png b/docs/assets/contributing/dockerfile-stages-dependency.png
index c8839eb..9ac394d 100644
Binary files a/docs/assets/contributing/dockerfile-stages-dependency.png and b/docs/assets/contributing/dockerfile-stages-dependency.png differ
Error: Process completed with exit code 1.

Running from a different location (MacBook Pro in this case) seems to work.

auto-merge was automatically disabled January 22, 2026 18:51

Head branch was pushed to by a user without write access

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Jan 22, 2026

Documentation preview: https://vllm--30443.org.readthedocs.build/en/30443/

@mergify mergify bot added the documentation Improvements or additions to documentation label Jan 22, 2026
Signed-off-by: Orion Reblitz-Richardson <orionr@gmail.com>
@orionr orionr force-pushed the orionr/pt-nightlies branch from 69fdad8 to f1851bb Compare January 22, 2026 19:07
@khluu khluu enabled auto-merge (squash) January 22, 2026 22:28
@simon-mo simon-mo disabled auto-merge January 23, 2026 18:22
@simon-mo simon-mo merged commit 68b0a6c into vllm-project:main Jan 23, 2026
145 of 147 checks passed
cwazai pushed a commit to cwazai/vllm that referenced this pull request Jan 25, 2026
…h tests (vllm-project#30443)

Signed-off-by: Orion Reblitz-Richardson <orionr@meta.com>
Signed-off-by: Orion Reblitz-Richardson <orionr@gmail.com>
Co-authored-by: Kevin H. Luu <khluu000@gmail.com>
Signed-off-by: 陈建华 <1647430658@qq.com>
lapy pushed a commit to lapy/vllm that referenced this pull request Jan 27, 2026
…h tests (vllm-project#30443)

Signed-off-by: Orion Reblitz-Richardson <orionr@meta.com>
Signed-off-by: Orion Reblitz-Richardson <orionr@gmail.com>
Co-authored-by: Kevin H. Luu <khluu000@gmail.com>
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
…h tests (vllm-project#30443)

Signed-off-by: Orion Reblitz-Richardson <orionr@meta.com>
Signed-off-by: Orion Reblitz-Richardson <orionr@gmail.com>
Co-authored-by: Kevin H. Luu <khluu000@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build documentation Improvements or additions to documentation ready ONLY add when PR is ready to merge/full CI is needed ready-run-all-tests Trigger CI with all tests for wide-ranging PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants