Skip to content

ci: bump Mac json-images timeout 30 -> 45 min (cache-miss path)#5475

Merged
danielhanchen merged 1 commit into
mainfrom
ci-mac-json-images-bump-timeout
May 16, 2026
Merged

ci: bump Mac json-images timeout 30 -> 45 min (cache-miss path)#5475
danielhanchen merged 1 commit into
mainfrom
ci-mac-json-images-bump-timeout

Conversation

@danielhanchen
Copy link
Copy Markdown
Member

Summary

Test plan

The `JSON, images` job in `studio-mac-inference-smoke.yml` (Job 3
of Mac Studio GGUF CI) downloads ~4 GB on a cache miss: 3 GB
gemma-4-E2B-it-UD-Q4_K_XL.gguf + ~1 GB mmproj-F16.gguf. The 30 min
cap was tight even with `HF_HUB_ENABLE_HF_TRANSFER=1` and parallel
downloads, and timed out the cache-miss run on PR #5430 mid-download
(run 25950714888) before Studio install or the smoke assertions ran.

Once the actions/cache restore hits, the job comes in under 10 min,
so 45 min only costs runner time on the first run after a cache
key bump (v1->v2 was just bumped in #5459, which is what produced
this failure). Jobs 1 (openai-anthropic, 270M model) and 2
(tool-calling, ~1.5 GB model) are not bumped -- their 25 min cap
has been comfortable.
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Note

Gemini is unable to generate a review for this pull request due to the file types involved not being currently supported.

@danielhanchen danielhanchen merged commit 2958446 into main May 16, 2026
7 of 9 checks passed
@danielhanchen danielhanchen deleted the ci-mac-json-images-bump-timeout branch May 16, 2026 03:52
Stanley00 pushed a commit to stanley-fork/unsloth that referenced this pull request May 16, 2026
…nslothai#5476)

Root cause of the Mac json-images 30 min timeout (run 25950714888 /
PR unslothai#5430): huggingface_hub>=1.15 deprecated `hf_transfer` and routes
every transfer through `hf-xet`. The CI step's unpinned
`pip install --upgrade huggingface_hub hf_transfer` jumped to 1.15.0
+ hf-xet 1.5.0, the 940 MB mmproj finished in ~21s, then the 3 GB
gemma-4 GGUF made it to ~46% and went completely silent for the
remaining 29 minutes -- no progress bytes, no error, no exit -- until
the job timeout fired.

This wraps every CI `hf download` in a new
`.github/scripts/hf-download-with-retry.sh`:

  * Drops the no-op `HF_HUB_ENABLE_HF_TRANSFER=1` prefix and the
    `hf_transfer` install (both are deprecated on 1.15+ and only
    emit a FutureWarning now).
  * Exports the hf-xet high-performance knobs Daniel asked for:
        HF_XET_HIGH_PERFORMANCE=1
        HF_XET_CHUNK_CACHE_SIZE_BYTES=0
        HF_XET_NUM_CONCURRENT_RANGE_GETS=64
        HF_XET_RECONSTRUCT_WRITE_SEQUENTIALLY=0
        HF_XET_CLIENT_READ_TIMEOUT=500
  * Watchdogs each attempt: if `hf download` has not exited after
    HF_DOWNLOAD_STALL_SECONDS (default 180s = 3 min), SIGTERM,
    sleep 2, SIGKILL, then loop. Retries are unbounded; the
    enclosing job's `timeout-minutes` is the real cap.
  * Optional 3rd positional `LOCAL_DIR` -- omitted lets `hf` use
    the default HF_HUB_CACHE, which is what the HF_HOME-priming
    jobs need.

19 call sites migrated across mlx-ci.yml + 9 studio-*-smoke.yml
workflows. The inline `python -c "from huggingface_hub import
hf_hub_download; ..."` block in mlx-ci.yml is also routed through
the wrapper so every hf transfer in CI gets the same treatment.

Also reverts the json-images timeout 45 -> 30 from unslothai#5475: the bump
was masking this hang, not fixing it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant