Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 48 additions & 0 deletions buildkite/test-template-ci.j2
Original file line number Diff line number Diff line change
Expand Up @@ -507,6 +507,54 @@ steps:
- exit_status: -10 # Agent was lost
limit: 2

{% if branch == "main" %}
- label: ":docker: build image HPU"
key: image-build-hpu
depends_on: ~
agents:
queue: cpu_queue_postmerge_us_east_1
Comment thread
jakub-sochacki marked this conversation as resolved.
commands:
- "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
- |
Copy link
Copy Markdown
Collaborator

@khluu khluu Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add this script into vllm repo and call it here instead of sending the whole script as part of commands?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, PR is submitted to vllm repo as well:
vllm-project/vllm#26919

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the script in vllm-project/vllm#26919

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, which part you want us move to vllm-project?
Because hpu is a plugin, so we use below line to build our docker, is this part you suggested to be part of vllm?

# Fetch the compatible vLLM commit for vllm-gaudi
VLLM_STABLE_COMMIT=$(curl -s https://raw.githubusercontent.com/vllm-project/vllm-gaudi/vllm/last-good-commit-for-vllm-gaudi/VLLM_STABLE_COMMIT | tr -d '\n')

git clone https://github.com/vllm-project/vllm-gaudi.git /tmp/vllm-gaudi

docker build \
  --file /tmp/vllm-gaudi/tests/pytorch_ci_hud_benchmark/Dockerfile.hpu \
  --build-arg max_jobs=16 \
  --build-arg VLLM_COMMIT=$VLLM_STABLE_COMMIT \
  --build-arg VLLM_GAUDI_COMMIT=main \
  --tag "$HPU_IMAGE_TAG" \
  --progress plain .

The vllm_stable_commit is necessary because some latest commit might failed vllm-gaudi,. So we are tracking the last good vllm commit sha

and rest part is to use Dockerfile.hpu exiting in vllm-gaudi instead of vllm to build docker

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh what I mean is can we store the bash script in the command into a file on vllm-project/vllm repo, then just call that script here, like https://github.com/vllm-project/ci-infra/blob/main/buildkite/test-template-ci.j2#L722

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@khluu I understand that scripts like https://github.com/vllm-project/vllm/blob/main/.buildkite/scripts/hardware_ci/run-xpu-test.sh are to running in the CI. Our goal is to build images in the CI but run performance benchmarks in nightly / 12h-cadence. Do you suggest moving docker building to separate script in vllm?

Copy link
Copy Markdown
Collaborator

@xuechendi xuechendi Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@khluu , I discussed with Jakub.
Can we merge as it is now for this PR, the Pytorch-integration PR is merged. so we want to see how it goes, in case we might need more fix to clear the path.

=> The reason we have to do it now, is in the Pytorch-integration PR, we also use VLLM_STABLE_COMMIT as part of image name to index, that is why in this PR, we need to use the commit_id to tag image.

In next PR,
we will remove the whole VLLM_STABLE_COMMIT thing, and directly using BUILDKITE_COMMIT, and also submit another pytorch-integration PR to use BUILDKITE_COMMIT to index hpu_docker_image there.

So the HPU docker build can be simplified with single step which is to build docker from vllm-gaudi / Dockerfile.hpu

#!/bin/bash
# Fetch the compatible vLLM commit for vllm-gaudi
VLLM_STABLE_COMMIT=$(curl -s https://raw.githubusercontent.com/vllm-project/vllm-gaudi/vllm/last-good-commit-for-vllm-gaudi/VLLM_STABLE_COMMIT | tr -d '\n')
echo "Compatible vLLM commit for vllm-gaudi: $VLLM_STABLE_COMMIT"

# HPU images always use postmerge registry (main branch only)
REGISTRY="public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo"

# HPU images use the stable commit tag, not BUILDKITE_COMMIT
HPU_IMAGE_TAG="${REGISTRY}:${VLLM_STABLE_COMMIT}-hpu"

if [[ -z $(docker manifest inspect "$HPU_IMAGE_TAG") ]]; then
echo "Image not found, proceeding with build..."
else
echo "Image $HPU_IMAGE_TAG already exists"
exit 0
fi

git clone https://github.com/vllm-project/vllm-gaudi.git /tmp/vllm-gaudi

docker build \
--file /tmp/vllm-gaudi/tests/pytorch_ci_hud_benchmark/Dockerfile.hpu \
--build-arg max_jobs=16 \
--build-arg VLLM_COMMIT=$VLLM_STABLE_COMMIT \
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakub-sochacki, does VLLM_COMMIT needed here, I realized you have add same step in the Dockerfile, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ensures that the same vllm commit will be used in the dockerfile and in the docker image tag HPU_IMAGE_TAG="${REGISTRY}:${VLLM_STABLE_COMMIT}-hpu"

--build-arg VLLM_GAUDI_COMMIT=main \
--tag "$HPU_IMAGE_TAG" \
--progress plain .

docker push "$HPU_IMAGE_TAG"
env:
DOCKER_BUILDKIT: "1"
retry:
automatic:
- exit_status: -1 # Agent was lost
limit: 2
- exit_status: -10 # Agent was lost
limit: 2
{% endif %}

{% for step in steps %}
{% if step.fast_check_only != true %}

Expand Down