[Bugfix] harden diffusion model prefetch against transformers v5 shard-resolution race by tzhouam · Pull Request #4076 · vllm-project/vllm-omni

tzhouam · 2026-06-02T09:02:19Z

Summary

Cherry-picks the diffusion model-loading prefetch protection from dev/vllm-align into a standalone PR, since this hardening is now significant for CI stability.

Strengthens vllm_omni/diffusion/model_loader/hub_prefetch.py with a node-wide repo lock (fcntl.flock with an atomic dotfile-lock fallback for FSx/Lustre), bounded retry-with-backoff around snapshot_download, auth/gated-repo escalation, and a new from_pretrained_with_prefetch helper that re-prefetches and reloads to heal a half-written cache.
Wires from_pretrained_with_prefetch / prefetch_subfolders into 19 diffusion pipelines (Qwen-Image family, Wan2.2, LTX2, SD3, Flux2-Klein, HiDream, HunyuanVideo 1.5, LongCat, OmniGen2, Ovis, StableAudio).

Why

After the transformers v5 rebase, cached_files batch-resolves all shards up-front and raises OSError: <repo> does not appear to have a file named ... when a peer worker's shard set is still partially written. This is a latent race that surfaces under cold/partially-evicted shared HF caches in CI (Buildkite #1043 / #1858), crashing diffusion workers. The prefetch lock + retry/heal closes the window.

Test plan

CI: diffusion e2e / multi-worker pipelines pass on a cold HF_HOME
No regression on warm-cache runs (prefetch is a near-noop)

Made with Cursor

- Introduced `from_pretrained_with_prefetch` to handle racy cache scenarios by re-prefetching and retrying on failures. - Updated various model pipelines to utilize the new prefetching mechanism, ensuring robust loading of model components. - Prefetch logic added to multiple models, including HunyuanVideo, LongCatImage, and QwenImage, to mitigate issues with incomplete cache states.

…rity during prefetch attempts

tzhouam · 2026-06-02T09:04:44Z

@hsliuustc0106 Could you add a diffusion test tag so that we can test if the diffusion nightly tests can stably passed.

Gaohan123

LGTM. Thanks

hsliuustc0106 · 2026-06-02T15:44:32Z

Skimmed the full diff. The core retry logic in from_pretrained_with_prefetch and prefetch_subfolders looks solid — well-scoped healing, good can_heal guard, correct backoff.

One gap: vllm_omni/diffusion/models/z_image/pipeline_z_image.py uses prefetch_subfolders on main but its AutoTokenizer.from_pretrained(model, subfolder="tokenizer", ...) was not wrapped with from_pretrained_with_prefetch. Low risk (tokenizer is small, no shards), but inconsistent with the pattern everywhere else. Can be a follow-up.

hsliuustc0106 · 2026-06-02T16:28:53Z

+# writer's per-blob ``.lock`` and then returns a complete tree. So a bounded
+# retry with linear backoff is what actually closes the window that a single
+# best-effort attempt left open (Buildkite vllm-omni-rebase #1858: both the
+# ``cuda_ti2v_hsdp`` missing-shard ``OSError`` and the ``wan_2_1_vace`` default


Consider making _PREFETCH_MAX_ATTEMPTS and _PREFETCH_BACKOFF_BASE_S configurable via environment variables for CI tuning.

hsliuustc0106 · 2026-06-04T00:50:24Z

acc test failed

Gaohan123 · 2026-06-04T03:07:22Z

acc test failed

This is the known issue in main: #4029

tzhouam added 2 commits June 2, 2026 09:01

fix: correct logging message format in hub_prefetch.py for better cla…

cb08af9

…rity during prefetch attempts

tzhouam requested review from Isotr0py, RuixiangMa, SamitHuang, ZJY0516, david6666666, princepride and wtomin as code owners June 2, 2026 09:02

tzhouam requested review from Gaohan123 and hsliuustc0106 June 2, 2026 09:05

tzhouam added the ready label to trigger buildkite CI label Jun 2, 2026

tzhouam changed the title ~~feat: harden diffusion model prefetch against transformers v5 shard-resolution race~~ [Bugfix] harden diffusion model prefetch against transformers v5 shard-resolution race Jun 2, 2026

Gaohan123 linked an issue Jun 2, 2026 that may be closed by this pull request

[Bug]: OSError: model does not appear to have a file named xx. Root Reason: HF shared cache concurrent shard materialization race under multi-worker/HSDP startup #3966

Closed

1 task

hsliuustc0106 added diffusion-x2iat-test label to trigger buildkite x2image + x2audio + x2text series of diffusion models test in nightly CI diffusion-x2v-test label to trigger buildkite x2video series of diffusion models test in nightly CI labels Jun 2, 2026

Gaohan123 added this to the v0.22.0 milestone Jun 2, 2026

Merge branch 'main' into feat/diffusion-prefetch-protection

69fd466

Gaohan123 approved these changes Jun 2, 2026

View reviewed changes

Gaohan123 enabled auto-merge (squash) June 2, 2026 15:30

hsliuustc0106 approved these changes Jun 2, 2026

View reviewed changes

hsliuustc0106 reviewed Jun 2, 2026

View reviewed changes

linyueqian mentioned this pull request Jun 3, 2026

[Bug][CI]: GLM-TTS merge-gate flake — SHA256 mismatch on lazy HF model download during test #4095

Open

Merge branch 'main' into feat/diffusion-prefetch-protection

f0758f5

Gaohan123 disabled auto-merge June 4, 2026 04:13

Gaohan123 merged commit af02713 into main Jun 4, 2026
8 of 10 checks passed

fhfuih mentioned this pull request Jun 4, 2026

[Bug]: Nightly / CI failed - Qwen-Image-Edit related L4 tests #4124

Open

1 task

tzhouam mentioned this pull request Jun 4, 2026

[Bugfix] harden Omni weight snapshot downloads #4139

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix] harden diffusion model prefetch against transformers v5 shard-resolution race#4076

[Bugfix] harden diffusion model prefetch against transformers v5 shard-resolution race#4076
Gaohan123 merged 4 commits into
mainfrom
feat/diffusion-prefetch-protection

tzhouam commented Jun 2, 2026

Uh oh!

tzhouam commented Jun 2, 2026

Uh oh!

Gaohan123 left a comment

Uh oh!

hsliuustc0106 commented Jun 2, 2026

Uh oh!

hsliuustc0106 Jun 2, 2026

Uh oh!

hsliuustc0106 commented Jun 4, 2026

Uh oh!

Gaohan123 commented Jun 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tzhouam commented Jun 2, 2026

Summary

Why

Test plan

Uh oh!

tzhouam commented Jun 2, 2026

Uh oh!

Gaohan123 left a comment

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 commented Jun 2, 2026

Uh oh!

hsliuustc0106 Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 commented Jun 4, 2026

Uh oh!

Gaohan123 commented Jun 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants