-
Notifications
You must be signed in to change notification settings - Fork 68
[CI/Build][Intel] Add HPU image build with vllm-gaudi compatibility #191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
6c32ec8
f472a8d
1d5e989
8e5f332
bb802f2
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -507,6 +507,54 @@ steps: | |
| - exit_status: -10 # Agent was lost | ||
| limit: 2 | ||
|
|
||
| {% if branch == "main" %} | ||
| - label: ":docker: build image HPU" | ||
| key: image-build-hpu | ||
| depends_on: ~ | ||
| agents: | ||
| queue: cpu_queue_postmerge_us_east_1 | ||
| commands: | ||
| - "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7" | ||
| - | | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we add this script into vllm repo and call it here instead of sending the whole script as part of commands?
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, PR is submitted to vllm repo as well:
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't see the script in vllm-project/vllm#26919
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh, which part you want us move to vllm-project? The vllm_stable_commit is necessary because some latest commit might failed vllm-gaudi,. So we are tracking the last good vllm commit sha and rest part is to use Dockerfile.hpu exiting in vllm-gaudi instead of vllm to build docker
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh what I mean is can we store the bash script in the command into a file on
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @khluu I understand that scripts like https://github.com/vllm-project/vllm/blob/main/.buildkite/scripts/hardware_ci/run-xpu-test.sh are to running in the CI. Our goal is to build images in the CI but run performance benchmarks in nightly / 12h-cadence. Do you suggest moving docker building to separate script in vllm?
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @khluu , I discussed with Jakub. => The reason we have to do it now, is in the Pytorch-integration PR, we also use VLLM_STABLE_COMMIT as part of image name to index, that is why in this PR, we need to use the commit_id to tag image. In next PR, So the HPU docker build can be simplified with single step which is to build docker from vllm-gaudi / Dockerfile.hpu |
||
| #!/bin/bash | ||
| # Fetch the compatible vLLM commit for vllm-gaudi | ||
| VLLM_STABLE_COMMIT=$(curl -s https://raw.githubusercontent.com/vllm-project/vllm-gaudi/vllm/last-good-commit-for-vllm-gaudi/VLLM_STABLE_COMMIT | tr -d '\n') | ||
| echo "Compatible vLLM commit for vllm-gaudi: $VLLM_STABLE_COMMIT" | ||
|
|
||
| # HPU images always use postmerge registry (main branch only) | ||
| REGISTRY="public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo" | ||
|
|
||
| # HPU images use the stable commit tag, not BUILDKITE_COMMIT | ||
| HPU_IMAGE_TAG="${REGISTRY}:${VLLM_STABLE_COMMIT}-hpu" | ||
|
|
||
| if [[ -z $(docker manifest inspect "$HPU_IMAGE_TAG") ]]; then | ||
| echo "Image not found, proceeding with build..." | ||
| else | ||
| echo "Image $HPU_IMAGE_TAG already exists" | ||
| exit 0 | ||
| fi | ||
|
|
||
| git clone https://github.com/vllm-project/vllm-gaudi.git /tmp/vllm-gaudi | ||
|
|
||
| docker build \ | ||
| --file /tmp/vllm-gaudi/tests/pytorch_ci_hud_benchmark/Dockerfile.hpu \ | ||
| --build-arg max_jobs=16 \ | ||
| --build-arg VLLM_COMMIT=$VLLM_STABLE_COMMIT \ | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jakub-sochacki, does VLLM_COMMIT needed here, I realized you have add same step in the Dockerfile, right?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This ensures that the same vllm commit will be used in the dockerfile and in the docker image tag |
||
| --build-arg VLLM_GAUDI_COMMIT=main \ | ||
| --tag "$HPU_IMAGE_TAG" \ | ||
| --progress plain . | ||
|
|
||
| docker push "$HPU_IMAGE_TAG" | ||
| env: | ||
| DOCKER_BUILDKIT: "1" | ||
| retry: | ||
| automatic: | ||
| - exit_status: -1 # Agent was lost | ||
| limit: 2 | ||
| - exit_status: -10 # Agent was lost | ||
| limit: 2 | ||
| {% endif %} | ||
|
|
||
| {% for step in steps %} | ||
| {% if step.fast_check_only != true %} | ||
|
|
||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.