[Bugfix] harden Omni weight snapshot downloads by tzhouam · Pull Request #4139 · vllm-project/vllm-omni

tzhouam · 2026-06-04T07:54:38Z

Summary

Follow-up to #4076. That PR hardened diffusion pipeline prefetching, but Buildkite still shows Qwen3-Omni online serving failing during stage-0 startup because the generic Omni HF snapshot path can still expose a half-materialized shared cache to transformers.

Adds a repo-wide node lock around download_weights_from_hf_specific() using fcntl.flock with an atomic dotfile-lock fallback.
Retries transient incomplete snapshot downloads with backoff.
Verifies full-repo downloads include common metadata files before transformers tries to load tokenizers/processors/feature extractors.

Failure log: https://buildkite.com/vllm/vllm-omni/builds/11013/canvas?jid=019e90e1-a221-4fac-9049-9ac2492801ed&tab=output

Test plan

python -m py_compile vllm_omni/model_executor/model_loader/weight_utils.py
python -m ruff check vllm_omni/model_executor/model_loader/weight_utils.py
python -m ruff format --check vllm_omni/model_executor/model_loader/weight_utils.py
CI: rerun Buildkite qwen3 omni test from build #11013

Made with Cursor

Serialize full-repo Hugging Face snapshot materialization with a repo-wide node lock and retry incomplete snapshots so CI does not race on shared caches. Co-authored-by: Cursor <cursoragent@cursor.com>

chatgpt-codex-connector · 2026-06-04T07:54:44Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

hsliuustc0106 · 2026-06-04T14:46:45Z

the ci fails due to qwen-image & HY-image acc, please check whether they are related

hsliuustc0106 · 2026-06-04T22:19:22Z

please check the ci failures again an confirm whether it is not related

fix: harden omni weight snapshot downloads

930c317

Serialize full-repo Hugging Face snapshot materialization with a repo-wide node lock and retry incomplete snapshots so CI does not race on shared caches. Co-authored-by: Cursor <cursoragent@cursor.com>

tzhouam requested a review from gcanlin as a code owner June 4, 2026 07:54

tzhouam changed the title ~~fix: harden Omni weight snapshot downloads~~ [Bugfix] harden Omni weight snapshot downloads Jun 4, 2026

tzhouam requested review from Gaohan123 and hsliuustc0106 June 4, 2026 07:56

tzhouam added the ready label to trigger buildkite CI label Jun 4, 2026

hsliuustc0106 added the nightly-test label to trigger buildkite nightly test CI label Jun 4, 2026

Merge branch 'main' into fix/omni-weight-snapshot-prefetch

49c1025

Gaohan123 removed nightly-test label to trigger buildkite nightly test CI ready label to trigger buildkite CI labels Jun 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix] harden Omni weight snapshot downloads#4139

[Bugfix] harden Omni weight snapshot downloads#4139
tzhouam wants to merge 2 commits into
mainfrom
fix/omni-weight-snapshot-prefetch

tzhouam commented Jun 4, 2026

Uh oh!

chatgpt-codex-connector Bot commented Jun 4, 2026

Uh oh!

hsliuustc0106 commented Jun 4, 2026

Uh oh!

hsliuustc0106 commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tzhouam commented Jun 4, 2026

Summary

Test plan

Uh oh!

chatgpt-codex-connector Bot commented Jun 4, 2026

Uh oh!

hsliuustc0106 commented Jun 4, 2026

Uh oh!

hsliuustc0106 commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants