Skip to content

[Bugfix] harden diffusion model prefetch against transformers v5 shard-resolution race#4076

Merged
Gaohan123 merged 4 commits into
mainfrom
feat/diffusion-prefetch-protection
Jun 4, 2026
Merged

[Bugfix] harden diffusion model prefetch against transformers v5 shard-resolution race#4076
Gaohan123 merged 4 commits into
mainfrom
feat/diffusion-prefetch-protection

Conversation

@tzhouam
Copy link
Copy Markdown
Collaborator

@tzhouam tzhouam commented Jun 2, 2026

Summary

Cherry-picks the diffusion model-loading prefetch protection from dev/vllm-align into a standalone PR, since this hardening is now significant for CI stability.

  • Strengthens vllm_omni/diffusion/model_loader/hub_prefetch.py with a node-wide repo lock (fcntl.flock with an atomic dotfile-lock fallback for FSx/Lustre), bounded retry-with-backoff around snapshot_download, auth/gated-repo escalation, and a new from_pretrained_with_prefetch helper that re-prefetches and reloads to heal a half-written cache.
  • Wires from_pretrained_with_prefetch / prefetch_subfolders into 19 diffusion pipelines (Qwen-Image family, Wan2.2, LTX2, SD3, Flux2-Klein, HiDream, HunyuanVideo 1.5, LongCat, OmniGen2, Ovis, StableAudio).

Why

After the transformers v5 rebase, cached_files batch-resolves all shards up-front and raises OSError: <repo> does not appear to have a file named ... when a peer worker's shard set is still partially written. This is a latent race that surfaces under cold/partially-evicted shared HF caches in CI (Buildkite #1043 / #1858), crashing diffusion workers. The prefetch lock + retry/heal closes the window.

Test plan

  • CI: diffusion e2e / multi-worker pipelines pass on a cold HF_HOME
  • No regression on warm-cache runs (prefetch is a near-noop)

Made with Cursor

tzhouam added 2 commits June 2, 2026 09:01
- Introduced `from_pretrained_with_prefetch` to handle racy cache scenarios by re-prefetching and retrying on failures.
- Updated various model pipelines to utilize the new prefetching mechanism, ensuring robust loading of model components.
- Prefetch logic added to multiple models, including HunyuanVideo, LongCatImage, and QwenImage, to mitigate issues with incomplete cache states.
@tzhouam
Copy link
Copy Markdown
Collaborator Author

tzhouam commented Jun 2, 2026

@hsliuustc0106 Could you add a diffusion test tag so that we can test if the diffusion nightly tests can stably passed.

@tzhouam tzhouam added the ready label to trigger buildkite CI label Jun 2, 2026
@tzhouam tzhouam changed the title feat: harden diffusion model prefetch against transformers v5 shard-resolution race [Bugfix] harden diffusion model prefetch against transformers v5 shard-resolution race Jun 2, 2026
@hsliuustc0106 hsliuustc0106 added diffusion-x2iat-test label to trigger buildkite x2image + x2audio + x2text series of diffusion models test in nightly CI diffusion-x2v-test label to trigger buildkite x2video series of diffusion models test in nightly CI labels Jun 2, 2026
@Gaohan123 Gaohan123 added this to the v0.22.0 milestone Jun 2, 2026
Copy link
Copy Markdown
Collaborator

@Gaohan123 Gaohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks

@Gaohan123 Gaohan123 enabled auto-merge (squash) June 2, 2026 15:30
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

Skimmed the full diff. The core retry logic in from_pretrained_with_prefetch and prefetch_subfolders looks solid — well-scoped healing, good can_heal guard, correct backoff.

One gap: vllm_omni/diffusion/models/z_image/pipeline_z_image.py uses prefetch_subfolders on main but its AutoTokenizer.from_pretrained(model, subfolder="tokenizer", ...) was not wrapped with from_pretrained_with_prefetch. Low risk (tokenizer is small, no shards), but inconsistent with the pattern everywhere else. Can be a follow-up.

# writer's per-blob ``.lock`` and then returns a complete tree. So a bounded
# retry with linear backoff is what actually closes the window that a single
# best-effort attempt left open (Buildkite vllm-omni-rebase #1858: both the
# ``cuda_ti2v_hsdp`` missing-shard ``OSError`` and the ``wan_2_1_vace`` default
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider making _PREFETCH_MAX_ATTEMPTS and _PREFETCH_BACKOFF_BASE_S configurable via environment variables for CI tuning.

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

acc test failed

@Gaohan123
Copy link
Copy Markdown
Collaborator

acc test failed

This is the known issue in main: #4029

@Gaohan123 Gaohan123 disabled auto-merge June 4, 2026 04:13
@Gaohan123 Gaohan123 merged commit af02713 into main Jun 4, 2026
8 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

diffusion-x2iat-test label to trigger buildkite x2image + x2audio + x2text series of diffusion models test in nightly CI diffusion-x2v-test label to trigger buildkite x2video series of diffusion models test in nightly CI ready label to trigger buildkite CI

Projects

None yet

3 participants