[ROCm] [Bugfix] [Release] Fix nightly rocm release pipeline#38263
[ROCm] [Bugfix] [Release] Fix nightly rocm release pipeline#38263tjtanaa merged 9 commits intovllm-project:mainfrom
Conversation
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
There was a problem hiding this comment.
Code Review
This pull request updates the Buildkite release pipeline by commenting out CUDA and CPU build steps and redirecting ROCm artifacts to a development S3 bucket. The ROCm pipeline is further modified with updated caching logic and variable escaping; however, the reviewer identified that caching is currently broken due to the removal of the push flag and a hardcoded variable override that forces a cache miss.
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
| buildkite-agent meta-data set "rocm-base-image-tag" "$${ECR_CACHE_TAG}" | ||
|
|
||
| # Scenario 3: Full rebuild needed | ||
| # Scenario 2: Full rebuild needed |
There was a problem hiding this comment.
This case (scenario 2) is not useful as we are not pushing all of the intermediate stages to the ECR:
- base
- build_triton
- build_pytorch
- build_fa
- build_aiter
- build_amdsmi
- build_mori
- debs
- debs_wheel_release
So we are not able to reuse the cache from base image through --cache-from type=registry,ref=public.ecr.aws/q9t5s3a7/vllm-release-repo:1227c9527d573e09-rocm-base
Example command:
DOCKER_BUILDKIT=1 docker buildx build \
--file docker/Dockerfile.rocm_base \
--tag rocm-base-debs:$${BUILDKITE_BUILD_NUMBER} \
--target debs_wheel_release \
--cache-from type=registry,ref=public.ecr.aws/q9t5s3a7/vllm-release-repo:1227c9527d573e09-rocm-base \
--build-arg USE_SCCACHE=1 \
--build-arg SCCACHE_BUCKET_NAME=vllm-build-sccache \
--build-arg SCCACHE_REGION_NAME=us-west-2 \
--build-arg SCCACHE_S3_NO_CREDENTIALS=0 \
--load \
.
|
@claude review |
|
@gemini review |
There was a problem hiding this comment.
Code Review
This pull request removes the partial cache handling logic in the ROCm release pipeline and corrects a variable interpolation error for the build number. The review feedback points out that removing the partial cache logic may cause performance regressions and suggests a specific fix to resolve the underlying issue while retaining the cache optimization.
…ject#38263) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Signed-off-by: iamvastava <iamvastava@gmail.com>
…ject#38263) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
…ject#38263) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Signed-off-by: Nithin Chalapathi <nithin.ch10@gmail.com>
…ject#38263) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
…ject#38263) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Signed-off-by: Rishi Puri <riship@nvidia.com>
…ject#38263) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Purpose
Fix https://buildkite.com/vllm/release-v2/builds/92/steps/canvas?sid=019d2b00-5a75-412f-a8f2-9607ef7e0ddc&tab=output
/bin/bash: line 57: ECR_IMAGE_TAG: unbound variableintroduced by #37283Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.