Skip to content

[torch nightlies] Use main Dockerfile with flags for nightly torch tests#244

Merged
khluu merged 1 commit intomainfrom
orionr/pt-nightlies
Jan 23, 2026
Merged

[torch nightlies] Use main Dockerfile with flags for nightly torch tests#244
khluu merged 1 commit intomainfrom
orionr/pt-nightlies

Conversation

@orionr
Copy link
Copy Markdown
Collaborator

@orionr orionr commented Dec 10, 2025

Use standard Docker image instead of torch_nightly image for PyTorch nightlies testing and CI runs.

Moving this from #239 to a branch on upstream for testing purposes outlined at https://github.com/vllm-project/ci-infra?tab=readme-ov-file#how-to-test-changes-in-this-repo

Tests to confirm:

  1. Baseline (my vllm fork matching HEAD, no ci-infra changes) at https://buildkite.com/vllm/ci/builds/42874/steps/canvas. Allowed 5 test runs to move forward. -> Seems like PT nightlies build itself failed on installing flashinfer so all tests failed afterwards.
  2. New (my vllm changes at [CI][torch nightlies] Use main Dockerfile with flags for nightly torch tests vllm#30443, my ci-infra changes at [torch nightlies] Use main Dockerfile with flags for nightly torch tests #244) with a successful build at https://buildkite.com/vllm/ci/builds/45736/steps/canvas?sid=019b9459-43ce-46d3-99c2-c10a1a8ce96c. One downstream test is failing, but that looks real and something we will investigate.

We will remove https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.nightly_torch in a separate commit.

Looking for review and landing with help from @khluu, @amrmahdi, @atalman, @huydhn . Thanks!

@orionr orionr changed the title [PT nightlies] Remove nightly_torch Docker image and build [WIP][PT nightlies] Remove nightly_torch Docker image and build Dec 10, 2025
@orionr
Copy link
Copy Markdown
Collaborator Author

orionr commented Dec 10, 2025

@khluu I might need your help on this one and/or have you point me to an expert on Buildkite configs.

I'm trying to use the standard Docker builds here for PyTorch nightly testing, but need to also run uv pip install torch torchvision torchaudio --pre --extra-index-url https://download.pytorch.org/whl/nightly/cu128 (or create a Docker image layer) before each test runs. I thought I'd figured this out by adding an extra commands section, but looks like that might need to get propagated down to and through render_cuda_config. Is that the right way to do this or should I go a different path?

Current status is that the main Docker image is used (which is good), but tests are all running on release PyTorch versions (not good) without the latest changes.

Latest failing run is at https://buildkite.com/vllm/ci/builds/42927/steps/canvas?sid=019b0a30-bee1-4b6b-8393-7f85b537d2ef with the error


[2025-12-10T22:21:19Z] public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:2dcbac9077ecadff0aa78b7c282f9e147a260e86
--
Error: Can't use both a step level command and the command parameter of the plugin

because of e596c0d#diff-b5c060fa4acd68fd48a2b3cdcd4069bd9eae5b0ee8512e1b25d8f8e2526834e5R480

Any thoughts? cc @atalman as well and I'll keep digging.

@huydhn
Copy link
Copy Markdown
Collaborator

huydhn commented Dec 10, 2025

@khluu I might need your help on this one and/or have you point me to an expert on Buildkite configs.

I'm trying to use the standard Docker builds here for PyTorch nightly testing, but need to also run uv pip install torch torchvision torchaudio --pre --extra-index-url https://download.pytorch.org/whl/nightly/cu128 (or create a Docker image layer) before each test runs. I thought I'd figured this out by adding an extra commands section, but looks like that might need to get propagated down to and through render_cuda_config. Is that the right way to do this or should I go a different path?

Current status is that the main Docker image is used (which is good), but tests are all running on release PyTorch versions (not good) without the latest changes.

Latest failing run is at https://buildkite.com/vllm/ci/builds/42927/steps/canvas?sid=019b0a30-bee1-4b6b-8393-7f85b537d2ef with the error


[2025-12-10T22:21:19Z] public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:2dcbac9077ecadff0aa78b7c282f9e147a260e86
--
Error: Can't use both a step level command and the command parameter of the plugin

because of e596c0d#diff-b5c060fa4acd68fd48a2b3cdcd4069bd9eae5b0ee8512e1b25d8f8e2526834e5R480

Any thoughts? cc @atalman as well and I'll keep digging.

I think the uv pip install torch --pre --extra-index-url https://download.pytorch.org/whl/nightly/cu12x could only be done as a Docker layer inside https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile#L139-L143. Something likes:

if NIGHTLY == 1:
   uv pip install torch --pre --extra-index-url ${PYTORCH_CUDA_NIGHTLY_INDEX_BASE_URL}/cu$(echo $CUDA_VERSION | cut -d. -f1,2 | tr -d '.')
   python use_existing_torch
else:
    uv pip install -r requirements/cuda.txt --extra-index-url ${PYTORCH_CUDA_INDEX_BASE_URL}/cu$(echo $CUDA_VERSION | cut -d. -f1,2 | tr -d '.')

@orionr
Copy link
Copy Markdown
Collaborator Author

orionr commented Dec 10, 2025

Good call on needing build as well as test signal. Let me see what I can do to modify the base Dockerfile.

@orionr orionr force-pushed the orionr/pt-nightlies branch from 5424fa5 to 55368b3 Compare December 20, 2025 16:27
@orionr orionr changed the title [WIP][PT nightlies] Remove nightly_torch Docker image and build [PT nightlies] Remove nightly_torch Docker image and build Dec 20, 2025
@orionr orionr force-pushed the orionr/pt-nightlies branch from 55368b3 to 30eb1d6 Compare January 6, 2026 17:20
@orionr orionr changed the title [PT nightlies] Remove nightly_torch Docker image and build [PT nightlies] Use main Dockerfile with flags for nightly torch tests Jan 6, 2026
@orionr
Copy link
Copy Markdown
Collaborator Author

orionr commented Jan 6, 2026

Code changes should be ready for review after the final Buildkite test runs. And now done.

timeout_in_minutes: 600
commands:
- "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
- "aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 936637512419.dkr.ecr.us-east-1.amazonaws.com"
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this matches standard Dockerfile build flags both here and down below. The key is that vLLM builds with PT nightlies and standard vLLM builds should be identical here minus the --build-arg PYTORCH_NIGHTLY=1 flag. Unfortunately we can't unify further yet, but we can do that in some additional commits.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the main Nvidia GPU build this has now moved to buildkite/scripts/ci-bake.sh, but we should still keep these incremental changes towards what the main build was doing.

@orionr orionr changed the title [PT nightlies] Use main Dockerfile with flags for nightly torch tests [torch nightlies] Use main Dockerfile with flags for nightly torch tests Jan 8, 2026
@orionr orionr force-pushed the orionr/pt-nightlies branch from 5cbda75 to 0edf32b Compare January 9, 2026 17:17
Signed-off-by: Orion Reblitz-Richardson <orionr@meta.com>
@orionr
Copy link
Copy Markdown
Collaborator Author

orionr commented Jan 23, 2026

@simon-mo and @khluu thank you for merging vllm-project/vllm#30443! This is the other part of it. After this one we can land vllm-project/vllm#32426

@khluu khluu merged commit cb6f98d into main Jan 23, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants