Skip to content

[ROCm] [Bugfix] [Release] Fix nightly rocm release pipeline#38263

Merged
tjtanaa merged 9 commits intovllm-project:mainfrom
EmbeddedLLM:fix-nightly-rocm
Mar 26, 2026
Merged

[ROCm] [Bugfix] [Release] Fix nightly rocm release pipeline#38263
tjtanaa merged 9 commits intovllm-project:mainfrom
EmbeddedLLM:fix-nightly-rocm

Conversation

@tjtanaa
Copy link
Copy Markdown
Collaborator

@tjtanaa tjtanaa commented Mar 26, 2026

Purpose

Fix https://buildkite.com/vllm/release-v2/builds/92/steps/canvas?sid=019d2b00-5a75-412f-a8f2-9607ef7e0ddc&tab=output /bin/bash: line 57: ECR_IMAGE_TAG: unbound variable introduced by #37283

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

tjtanaa added 3 commits March 26, 2026 16:52
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the Buildkite release pipeline by commenting out CUDA and CPU build steps and redirecting ROCm artifacts to a development S3 bucket. The ROCm pipeline is further modified with updated caching logic and variable escaping; however, the reviewer identified that caching is currently broken due to the removal of the push flag and a hardcoded variable override that forces a cache miss.

Comment thread .buildkite/release-pipeline.yaml
Comment thread .buildkite/release-pipeline.yaml Outdated
@mergify mergify bot added ci/build rocm Related to AMD ROCm bug Something isn't working labels Mar 26, 2026
@github-project-automation github-project-automation bot moved this to Todo in AMD Mar 26, 2026
tjtanaa added 5 commits March 26, 2026 17:47
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
buildkite-agent meta-data set "rocm-base-image-tag" "$${ECR_CACHE_TAG}"

# Scenario 3: Full rebuild needed
# Scenario 2: Full rebuild needed
Copy link
Copy Markdown
Collaborator Author

@tjtanaa tjtanaa Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case (scenario 2) is not useful as we are not pushing all of the intermediate stages to the ECR:

  • base
  • build_triton
  • build_pytorch
  • build_fa
  • build_aiter
  • build_amdsmi
  • build_mori
  • debs
  • debs_wheel_release

So we are not able to reuse the cache from base image through --cache-from type=registry,ref=public.ecr.aws/q9t5s3a7/vllm-release-repo:1227c9527d573e09-rocm-base

Example command:

DOCKER_BUILDKIT=1 docker buildx build \
  --file docker/Dockerfile.rocm_base \
  --tag rocm-base-debs:$${BUILDKITE_BUILD_NUMBER} \
  --target debs_wheel_release \
  --cache-from type=registry,ref=public.ecr.aws/q9t5s3a7/vllm-release-repo:1227c9527d573e09-rocm-base \
  --build-arg USE_SCCACHE=1 \
  --build-arg SCCACHE_BUCKET_NAME=vllm-build-sccache \
  --build-arg SCCACHE_REGION_NAME=us-west-2 \
  --build-arg SCCACHE_S3_NO_CREDENTIALS=0 \
  --load \
  .

@tjtanaa tjtanaa marked this pull request as ready for review March 26, 2026 18:02
Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@tjtanaa tjtanaa changed the title [ROCm] [Bugfix] [Release] Fix nightly rocm [ROCm] [Bugfix] [Release] Fix nightly rocm release pipeline Mar 26, 2026
@tjtanaa
Copy link
Copy Markdown
Collaborator Author

tjtanaa commented Mar 26, 2026

@claude review

@tjtanaa
Copy link
Copy Markdown
Collaborator Author

tjtanaa commented Mar 26, 2026

@gemini review

@tjtanaa tjtanaa added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 26, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request removes the partial cache handling logic in the ROCm release pipeline and corrects a variable interpolation error for the build number. The review feedback points out that removing the partial cache logic may cause performance regressions and suggests a specific fix to resolve the underlying issue while retaining the cache optimization.

Comment thread .buildkite/release-pipeline.yaml
@tjtanaa tjtanaa enabled auto-merge (squash) March 26, 2026 18:20
@tjtanaa tjtanaa merged commit bc9c6fb into vllm-project:main Mar 26, 2026
13 of 14 checks passed
@github-project-automation github-project-automation bot moved this from Todo to Done in AMD Mar 26, 2026
asrvastava pushed a commit to asrvastava/vllm that referenced this pull request Mar 26, 2026
…ject#38263)

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: iamvastava <iamvastava@gmail.com>
RhizoNymph pushed a commit to RhizoNymph/vllm that referenced this pull request Mar 26, 2026
nithinvc pushed a commit to nithinvc/vllm that referenced this pull request Mar 27, 2026
…ject#38263)

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com>
JiantaoXu pushed a commit to JiantaoXu/vllm that referenced this pull request Mar 28, 2026
puririshi98 pushed a commit to puririshi98/vllm that referenced this pull request Apr 7, 2026
…ject#38263)

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: Rishi Puri <riship@nvidia.com>
mtparet pushed a commit to blackfuel-ai/vllm that referenced this pull request Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ci/build ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants