[Bugfix] harden diffusion model prefetch against transformers v5 shard-resolution race#4076
Conversation
- Introduced `from_pretrained_with_prefetch` to handle racy cache scenarios by re-prefetching and retrying on failures. - Updated various model pipelines to utilize the new prefetching mechanism, ensuring robust loading of model components. - Prefetch logic added to multiple models, including HunyuanVideo, LongCatImage, and QwenImage, to mitigate issues with incomplete cache states.
…rity during prefetch attempts
|
@hsliuustc0106 Could you add a diffusion test tag so that we can test if the diffusion nightly tests can stably passed. |
|
Skimmed the full diff. The core retry logic in One gap: |
| # writer's per-blob ``.lock`` and then returns a complete tree. So a bounded | ||
| # retry with linear backoff is what actually closes the window that a single | ||
| # best-effort attempt left open (Buildkite vllm-omni-rebase #1858: both the | ||
| # ``cuda_ti2v_hsdp`` missing-shard ``OSError`` and the ``wan_2_1_vace`` default |
There was a problem hiding this comment.
Consider making _PREFETCH_MAX_ATTEMPTS and _PREFETCH_BACKOFF_BASE_S configurable via environment variables for CI tuning.
|
acc test failed |
This is the known issue in main: #4029 |
Summary
Cherry-picks the diffusion model-loading prefetch protection from
dev/vllm-aligninto a standalone PR, since this hardening is now significant for CI stability.vllm_omni/diffusion/model_loader/hub_prefetch.pywith a node-wide repo lock (fcntl.flockwith an atomic dotfile-lock fallback for FSx/Lustre), bounded retry-with-backoff aroundsnapshot_download, auth/gated-repo escalation, and a newfrom_pretrained_with_prefetchhelper that re-prefetches and reloads to heal a half-written cache.from_pretrained_with_prefetch/prefetch_subfoldersinto 19 diffusion pipelines (Qwen-Image family, Wan2.2, LTX2, SD3, Flux2-Klein, HiDream, HunyuanVideo 1.5, LongCat, OmniGen2, Ovis, StableAudio).Why
After the transformers v5 rebase,
cached_filesbatch-resolves all shards up-front and raisesOSError: <repo> does not appear to have a file named ...when a peer worker's shard set is still partially written. This is a latent race that surfaces under cold/partially-evicted shared HF caches in CI (Buildkite #1043 / #1858), crashing diffusion workers. The prefetch lock + retry/heal closes the window.Test plan
HF_HOMEMade with Cursor