[ROCm][CI] Optimize ROCm Docker build: registry cache, DeepEP, and ci-bake script#36949
[ROCm][CI] Optimize ROCm Docker build: registry cache, DeepEP, and ci-bake script#36949AndreasKaratzas wants to merge 18 commits intovllm-project:mainfrom
Conversation
…line Signed-off-by: Andreas Karatzas <akaratza@amd.com>
There was a problem hiding this comment.
Code Review
This pull request introduces significant optimizations to the ROCm Docker build process by leveraging docker bake, multi-stage builds, and caching mechanisms like ccache. The new ci-bake.sh script centralizes and improves the CI build logic, enhancing build times and reliability. The changes are well-structured and thoughtful. I've identified a couple of critical issues related to missing runtime dependencies in the Dockerfile and a high-severity issue regarding configuration consistency in the new bake script.
| RUN --mount=type=bind,from=build_deepep,src=/app/deep_install,target=/deep_install \ | ||
| uv pip install --system /deep_install/*.whl |
There was a problem hiding this comment.
The rocshmem library appears to be a runtime dependency for deepep. This test stage installs the deepep wheel but no longer copies the rocshmem installation from the build stage. This could lead to runtime errors if the deepep wheel does not bundle the rocshmem shared libraries. Please restore the copy of the rocshmem directory from the build_rocshmem stage to ensure deepep can function correctly.
RUN --mount=type=bind,from=build_deepep,src=/app/deep_install,target=/deep_install \
uv pip install --system /deep_install/*.whl
# Copy rocshmem runtime libraries
COPY --from=build_rocshmem /opt/rocshmem /opt/rocshmem
| RUN --mount=type=bind,from=build_deepep,src=/app/deep_install,target=/deep_install \ | ||
| uv pip install --system /deep_install/*.whl |
There was a problem hiding this comment.
Similar to the test stage, the final stage now installs the deepep wheel but is missing the rocshmem runtime libraries which are likely a runtime dependency. This is likely to cause runtime failures. Please add a COPY instruction to include the rocshmem installation from the build_rocshmem stage.
RUN --mount=type=bind,from=build_deepep,src=/app/deep_install,target=/deep_install \
uv pip install --system /deep_install/*.whl
# Copy rocshmem runtime libraries
COPY --from=build_rocshmem /opt/rocshmem /opt/rocshmem
.buildkite/scripts/ci-bake.sh
Outdated
| # Check if baked-vllm-builder already exists and is using the socket | ||
| if docker buildx inspect baked-vllm-builder >/dev/null 2>&1; then | ||
| echo "Using existing baked-vllm-builder" | ||
| docker buildx use baked-vllm-builder | ||
| else | ||
| echo "Creating baked-vllm-builder with remote driver" | ||
| docker buildx create \ | ||
| --name baked-vllm-builder \ | ||
| --driver remote \ | ||
| --use \ | ||
| "unix://${BUILDKIT_SOCKET}" | ||
| fi |
There was a problem hiding this comment.
There's an inconsistency in the buildx builder naming. The script accepts a BUILDER_NAME environment variable (defaulting to vllm-builder), but when a local buildkitd socket is detected, it hardcodes the builder name to baked-vllm-builder. This could lead to confusion and incorrect builder usage if BUILDER_NAME is customized. For consistency, please use the ${BUILDER_NAME} variable throughout the script.
| # Check if baked-vllm-builder already exists and is using the socket | |
| if docker buildx inspect baked-vllm-builder >/dev/null 2>&1; then | |
| echo "Using existing baked-vllm-builder" | |
| docker buildx use baked-vllm-builder | |
| else | |
| echo "Creating baked-vllm-builder with remote driver" | |
| docker buildx create \ | |
| --name baked-vllm-builder \ | |
| --driver remote \ | |
| --use \ | |
| "unix://${BUILDKIT_SOCKET}" | |
| fi | |
| # Check if ${BUILDER_NAME} already exists and is using the socket | |
| if docker buildx inspect "${BUILDER_NAME}" >/dev/null 2>&1; then | |
| echo "Using existing builder: ${BUILDER_NAME}" | |
| docker buildx use "${BUILDER_NAME}" | |
| else | |
| echo "Creating builder '${BUILDER_NAME}' with remote driver" | |
| docker buildx create \ | |
| --name "${BUILDER_NAME}" \ | |
| --driver remote \ | |
| --use \ | |
| "unix://${BUILDKIT_SOCKET}" | |
| fi |
…line Signed-off-by: Andreas Karatzas <akaratza@amd.com>
docker/Dockerfile.rocm
Outdated
| apt-transport-https ca-certificates wget curl | ||
| apt-transport-https ca-certificates wget curl \ | ||
| ccache mold \ | ||
| && update-alternatives --install /usr/bin/ld ld /usr/bin/mold 100 |
There was a problem hiding this comment.
Changing the system loader hardly falls under "Install some basic utilities"
Could you at least provide the motivation for this in the PR description?
There was a problem hiding this comment.
Correct, mb, I updated the comment there as well.
docker/Dockerfile.rocm
Outdated
| RUN --mount=type=cache,target=/root/.cache/ccache \ | ||
| --mount=type=cache,target=/root/.cache/uv \ | ||
| cd vllm \ | ||
| && uv pip install --system -r requirements/rocm-build.txt \ |
There was a problem hiding this comment.
Why is rocm-build.txt being used in the docker build?
There was a problem hiding this comment.
That's an oversight on my part, thought it was just rocm.txt. I updated that as well.
docker/Dockerfile.rocm
Outdated
| COPY requirements/rocm-build.txt requirements/rocm-build.txt | ||
| COPY pyproject.toml setup.py CMakeLists.txt ./ | ||
| COPY cmake cmake/ | ||
| COPY csrc csrc/ |
There was a problem hiding this comment.
Are you copying host files here? The point of REMOTE_VLLM is exactly to not do this
There was a problem hiding this comment.
I refactored based on offline conversation to bring back the old way of doing this and avoid any trouble. I also integrated another recommended point which is a per-arch build so that we then use a specific docker dependency and not an all-arch dependency. Hope it looks better now :)
…line Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line Signed-off-by: Andreas Karatzas <akaratza@amd.com>
|
First successful build with vllm-project/ci-infra#307: https://buildkite.com/vllm/amd-ci/builds/6536/steps/canvas |
…line Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…e condition in highly concurrent max job settings Signed-off-by: Andreas Karatzas <akaratza@amd.com>
|
This pull request has merge conflicts that must be resolved before it can be |
|
@mawong-amd Let's check if |
.buildkite/scripts/ci-bake.sh: New AMD build script (pinned to vllm commit, no curl-at-runtime supply chain risk). Handles shallow clone deepening, auto-computesPARENT_COMMITandVLLM_MERGE_BASE_COMMITfor multi-layer BuildKit cache fallback. Uploads bake config as Buildkite artifact.docker/docker-bake-rocm.hcl: SetMAX_JOBSdefault to""so Dockerfile falls back to$(nproc)instead of hardcoded 16.docker/Dockerfile.rocm: Split orphanedbuild_deepstage into separatebuild_rocshmemandbuild_deepepstages. Both are now wired intotestandfinaltargets via--mount=type=bindso DeepEP is actually installed. Added ccache mounts to ROCSHMEM/DeepEP compilation. Fixedmake -j->make -j$(nproc)..buildkite/hardware_tests/amd.yaml: Refactored tocommands:+env:block, addedtimeout_in_minutes: 600, added exit code 128 retry rule.This PR is connected to: vllm-project/ci-infra#307
These two PRs should likely be merged simultaneously.
cc @kenroche @okakarpa @tjtanaa @gshtras @khluu