CI: retry docker pulls in workflow image downloads#2977
Conversation
Retry image pulls in ATOM, vLLM, and flash attention workflows so transient registry failures do not fail CI immediately. Add a shared helper where the job checks out aiter and keep an inline retry for the ATOM job that checks out the ATOM repo.
🏷️ CI GuideRuns automatically on every PR:
Extended tests (opt-in via labels):
|
There was a problem hiding this comment.
Pull request overview
This PR improves CI resilience by adding retry behavior to Docker image pulls across multiple workflows, reducing failures from transient registry/network issues.
Changes:
- Added a shared
.github/scripts/docker_pull_with_retry.shhelper to retrydocker pullwith configurable attempts/delay. - Updated vLLM benchmark and Flash Attention integration workflows to use the shared helper.
- Kept an inline retry loop in the ATOM workflow (since it checks out a different repository), and added an
aitercheckout to the vLLM benchmark job to access the shared helper.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
.github/workflows/vllm_benchmark.yaml |
Adds retry env knobs, uses shared docker-pull retry helper, and checks out aiter in the benchmark job to access the helper. |
.github/workflows/flash_attention_integration.yaml |
Adds retry env knobs and replaces direct docker pull with the shared retry helper in relevant jobs. |
.github/workflows/atom-test.yaml |
Adds retry env knobs and replaces a single docker pull with an inline retry loop (since it checks out the ATOM repo). |
.github/scripts/docker_pull_with_retry.sh |
New shared helper script implementing retry logic for Docker image pulls. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Sync the PR with main after the docker login retry changes landed so the branch keeps the newer vLLM wheel-artifact workflow while preserving docker pull retries.
|
|
||
| steps: | ||
| - name: Checkout aiter repo | ||
| uses: actions/checkout@v4 |
There was a problem hiding this comment.
nit: we should be using actions/checkout@v6.
Also check if we can only checkout the script itself instead of checking out the whole repo with something like sparse-checkout: .github/scripts should do
Use actions/checkout@v6 for the vLLM benchmark helper checkout and sparse-checkout only .github/scripts so the job does not clone the full repository just to access the docker pull retry helper.
* CI: retry docker pulls in workflow image downloads Retry image pulls in ATOM, vLLM, and flash attention workflows so transient registry failures do not fail CI immediately. Add a shared helper where the job checks out aiter and keep an inline retry for the ATOM job that checks out the ATOM repo. * CI: narrow vLLM helper checkout scope Use actions/checkout@v6 for the vLLM benchmark helper checkout and sparse-checkout only .github/scripts so the job does not clone the full repository just to access the docker pull retry helper.
Summary
docker_pull_with_retry.shhelper for jobs that check out the aiter repo and keep an inline retry loop for the ATOM job that checks out the ATOM repoTest plan
bash -n .github/scripts/docker_pull_with_retry.shpython3 -c "import sys, yaml; [yaml.safe_load(open(path, encoding='utf-8')) for path in sys.argv[1:]]; print('YAML OK')" .github/workflows/atom-test.yaml .github/workflows/vllm_benchmark.yaml .github/workflows/flash_attention_integration.yamlgit diff --check