Skip to content

[ROCm][CI] Optimize ROCm Docker build: registry cache, DeepEP, and ci-bake script#36949

Open
AndreasKaratzas wants to merge 18 commits intovllm-project:mainfrom
ROCm:akaratza_optimize_docker_build
Open

[ROCm][CI] Optimize ROCm Docker build: registry cache, DeepEP, and ci-bake script#36949
AndreasKaratzas wants to merge 18 commits intovllm-project:mainfrom
ROCm:akaratza_optimize_docker_build

Conversation

@AndreasKaratzas
Copy link
Collaborator

@AndreasKaratzas AndreasKaratzas commented Mar 13, 2026

  • .buildkite/scripts/ci-bake.sh: New AMD build script (pinned to vllm commit, no curl-at-runtime supply chain risk). Handles shallow clone deepening, auto-computes PARENT_COMMIT and VLLM_MERGE_BASE_COMMIT for multi-layer BuildKit cache fallback. Uploads bake config as Buildkite artifact.
  • docker/docker-bake-rocm.hcl: Set MAX_JOBS default to "" so Dockerfile falls back to $(nproc) instead of hardcoded 16.
  • docker/Dockerfile.rocm: Split orphaned build_deep stage into separate build_rocshmem and build_deepep stages. Both are now wired into test and final targets via --mount=type=bind so DeepEP is actually installed. Added ccache mounts to ROCSHMEM/DeepEP compilation. Fixed make -j -> make -j$(nproc).
  • .buildkite/hardware_tests/amd.yaml: Refactored to commands: + env: block, added timeout_in_minutes: 600, added exit code 128 retry rule.

This PR is connected to: vllm-project/ci-infra#307
These two PRs should likely be merged simultaneously.

cc @kenroche @okakarpa @tjtanaa @gshtras @khluu

…line

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant optimizations to the ROCm Docker build process by leveraging docker bake, multi-stage builds, and caching mechanisms like ccache. The new ci-bake.sh script centralizes and improves the CI build logic, enhancing build times and reliability. The changes are well-structured and thoughtful. I've identified a couple of critical issues related to missing runtime dependencies in the Dockerfile and a high-severity issue regarding configuration consistency in the new bake script.

Comment on lines +406 to 407
RUN --mount=type=bind,from=build_deepep,src=/app/deep_install,target=/deep_install \
uv pip install --system /deep_install/*.whl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The rocshmem library appears to be a runtime dependency for deepep. This test stage installs the deepep wheel but no longer copies the rocshmem installation from the build stage. This could lead to runtime errors if the deepep wheel does not bundle the rocshmem shared libraries. Please restore the copy of the rocshmem directory from the build_rocshmem stage to ensure deepep can function correctly.

RUN --mount=type=bind,from=build_deepep,src=/app/deep_install,target=/deep_install \
    uv pip install --system /deep_install/*.whl

# Copy rocshmem runtime libraries
COPY --from=build_rocshmem /opt/rocshmem /opt/rocshmem

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done :)

Comment on lines +491 to +492
RUN --mount=type=bind,from=build_deepep,src=/app/deep_install,target=/deep_install \
uv pip install --system /deep_install/*.whl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Similar to the test stage, the final stage now installs the deepep wheel but is missing the rocshmem runtime libraries which are likely a runtime dependency. This is likely to cause runtime failures. Please add a COPY instruction to include the rocshmem installation from the build_rocshmem stage.

RUN --mount=type=bind,from=build_deepep,src=/app/deep_install,target=/deep_install \
    uv pip install --system /deep_install/*.whl

# Copy rocshmem runtime libraries
COPY --from=build_rocshmem /opt/rocshmem /opt/rocshmem

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done :)

Comment on lines +85 to +96
# Check if baked-vllm-builder already exists and is using the socket
if docker buildx inspect baked-vllm-builder >/dev/null 2>&1; then
echo "Using existing baked-vllm-builder"
docker buildx use baked-vllm-builder
else
echo "Creating baked-vllm-builder with remote driver"
docker buildx create \
--name baked-vllm-builder \
--driver remote \
--use \
"unix://${BUILDKIT_SOCKET}"
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There's an inconsistency in the buildx builder naming. The script accepts a BUILDER_NAME environment variable (defaulting to vllm-builder), but when a local buildkitd socket is detected, it hardcodes the builder name to baked-vllm-builder. This could lead to confusion and incorrect builder usage if BUILDER_NAME is customized. For consistency, please use the ${BUILDER_NAME} variable throughout the script.

Suggested change
# Check if baked-vllm-builder already exists and is using the socket
if docker buildx inspect baked-vllm-builder >/dev/null 2>&1; then
echo "Using existing baked-vllm-builder"
docker buildx use baked-vllm-builder
else
echo "Creating baked-vllm-builder with remote driver"
docker buildx create \
--name baked-vllm-builder \
--driver remote \
--use \
"unix://${BUILDKIT_SOCKET}"
fi
# Check if ${BUILDER_NAME} already exists and is using the socket
if docker buildx inspect "${BUILDER_NAME}" >/dev/null 2>&1; then
echo "Using existing builder: ${BUILDER_NAME}"
docker buildx use "${BUILDER_NAME}"
else
echo "Creating builder '${BUILDER_NAME}' with remote driver"
docker buildx create \
--name "${BUILDER_NAME}" \
--driver remote \
--use \
"unix://${BUILDKIT_SOCKET}"
fi

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done :)

…line

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
apt-transport-https ca-certificates wget curl
apt-transport-https ca-certificates wget curl \
ccache mold \
&& update-alternatives --install /usr/bin/ld ld /usr/bin/mold 100
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the system loader hardly falls under "Install some basic utilities"
Could you at least provide the motivation for this in the PR description?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, mb, I updated the comment there as well.

RUN --mount=type=cache,target=/root/.cache/ccache \
--mount=type=cache,target=/root/.cache/uv \
cd vllm \
&& uv pip install --system -r requirements/rocm-build.txt \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is rocm-build.txt being used in the docker build?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an oversight on my part, thought it was just rocm.txt. I updated that as well.

COPY requirements/rocm-build.txt requirements/rocm-build.txt
COPY pyproject.toml setup.py CMakeLists.txt ./
COPY cmake cmake/
COPY csrc csrc/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you copying host files here? The point of REMOTE_VLLM is exactly to not do this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored based on offline conversation to bring back the old way of doing this and avoid any trouble. I also integrated another recommended point which is a per-arch build so that we then use a specific docker dependency and not an all-arch dependency. Hope it looks better now :)

@AndreasKaratzas AndreasKaratzas marked this pull request as draft March 13, 2026 18:27
…line

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
@AndreasKaratzas
Copy link
Collaborator Author

…line

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…e condition in highly concurrent max job settings

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
@AndreasKaratzas AndreasKaratzas marked this pull request as ready for review March 18, 2026 05:38
@mergify
Copy link

mergify bot commented Mar 19, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @AndreasKaratzas.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 19, 2026
@AndreasKaratzas
Copy link
Collaborator Author

@mawong-amd Let's check if Kernels Core Operation Test passes as well. We may need to bring back the compilation of triton_kernels. Not sure yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

2 participants