Skip to content

CI: retry docker pulls in workflow image downloads#2977

Merged
gyohuangxin merged 4 commits into
mainfrom
ci/docker-pull-retry-main-20260430
May 4, 2026
Merged

CI: retry docker pulls in workflow image downloads#2977
gyohuangxin merged 4 commits into
mainfrom
ci/docker-pull-retry-main-20260430

Conversation

@gyohuangxin
Copy link
Copy Markdown
Member

Summary

  • retry Docker image pulls in the ATOM, vLLM, and flash attention workflows so transient registry failures do not fail CI immediately
  • add a shared docker_pull_with_retry.sh helper for jobs that check out the aiter repo and keep an inline retry loop for the ATOM job that checks out the ATOM repo
  • check out the aiter repo in the vLLM benchmark job before reusing the shared Docker pull helper

Test plan

  • bash -n .github/scripts/docker_pull_with_retry.sh
  • python3 -c "import sys, yaml; [yaml.safe_load(open(path, encoding='utf-8')) for path in sys.argv[1:]]; print('YAML OK')" .github/workflows/atom-test.yaml .github/workflows/vllm_benchmark.yaml .github/workflows/flash_attention_integration.yaml
  • git diff --check

Retry image pulls in ATOM, vLLM, and flash attention workflows so transient registry failures do not fail CI immediately. Add a shared helper where the job checks out aiter and keep an inline retry for the ATOM job that checks out the ATOM repo.
@gyohuangxin gyohuangxin requested review from a team and Copilot April 30, 2026 10:01
@github-actions
Copy link
Copy Markdown
Contributor

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-300x Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
ci:sglang SGLang integration tests
ci:atom ATOM benchmark (DeepSeek-R1 + GPT-OSS)
ci:vllm vLLM benchmark
ci:all All of the above

Add labels via the sidebar or gh pr edit 2977 --add-label <label>

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves CI resilience by adding retry behavior to Docker image pulls across multiple workflows, reducing failures from transient registry/network issues.

Changes:

  • Added a shared .github/scripts/docker_pull_with_retry.sh helper to retry docker pull with configurable attempts/delay.
  • Updated vLLM benchmark and Flash Attention integration workflows to use the shared helper.
  • Kept an inline retry loop in the ATOM workflow (since it checks out a different repository), and added an aiter checkout to the vLLM benchmark job to access the shared helper.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
.github/workflows/vllm_benchmark.yaml Adds retry env knobs, uses shared docker-pull retry helper, and checks out aiter in the benchmark job to access the helper.
.github/workflows/flash_attention_integration.yaml Adds retry env knobs and replaces direct docker pull with the shared retry helper in relevant jobs.
.github/workflows/atom-test.yaml Adds retry env knobs and replaces a single docker pull with an inline retry loop (since it checks out the ATOM repo).
.github/scripts/docker_pull_with_retry.sh New shared helper script implementing retry logic for Docker image pulls.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sync the PR with main after the docker login retry changes landed so the branch keeps the newer vLLM wheel-artifact workflow while preserving docker pull retries.
@leo-automation leo-automation self-requested a review April 30, 2026 15:03
Comment thread .github/workflows/vllm_benchmark.yaml Outdated

steps:
- name: Checkout aiter repo
uses: actions/checkout@v4
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we should be using actions/checkout@v6.
Also check if we can only checkout the script itself instead of checking out the whole repo with something like sparse-checkout: .github/scripts should do

Use actions/checkout@v6 for the vLLM benchmark helper checkout and sparse-checkout only .github/scripts so the job does not clone the full repository just to access the docker pull retry helper.
@gyohuangxin gyohuangxin merged commit bf40536 into main May 4, 2026
37 of 40 checks passed
@gyohuangxin gyohuangxin deleted the ci/docker-pull-retry-main-20260430 branch May 4, 2026 14:08
Liang-jianhao97 pushed a commit that referenced this pull request May 7, 2026
* CI: retry docker pulls in workflow image downloads

Retry image pulls in ATOM, vLLM, and flash attention workflows so transient registry failures do not fail CI immediately. Add a shared helper where the job checks out aiter and keep an inline retry for the ATOM job that checks out the ATOM repo.

* CI: narrow vLLM helper checkout scope

Use actions/checkout@v6 for the vLLM benchmark helper checkout and sparse-checkout only .github/scripts so the job does not clone the full repository just to access the docker pull retry helper.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants