Skip to content

[Bugfix] harden Omni weight snapshot downloads#4139

Open
tzhouam wants to merge 2 commits into
mainfrom
fix/omni-weight-snapshot-prefetch
Open

[Bugfix] harden Omni weight snapshot downloads#4139
tzhouam wants to merge 2 commits into
mainfrom
fix/omni-weight-snapshot-prefetch

Conversation

@tzhouam
Copy link
Copy Markdown
Collaborator

@tzhouam tzhouam commented Jun 4, 2026

Summary

Follow-up to #4076. That PR hardened diffusion pipeline prefetching, but Buildkite still shows Qwen3-Omni online serving failing during stage-0 startup because the generic Omni HF snapshot path can still expose a half-materialized shared cache to transformers.

  • Adds a repo-wide node lock around download_weights_from_hf_specific() using fcntl.flock with an atomic dotfile-lock fallback.
  • Retries transient incomplete snapshot downloads with backoff.
  • Verifies full-repo downloads include common metadata files before transformers tries to load tokenizers/processors/feature extractors.

Failure log: https://buildkite.com/vllm/vllm-omni/builds/11013/canvas?jid=019e90e1-a221-4fac-9049-9ac2492801ed&tab=output

Test plan

  • python -m py_compile vllm_omni/model_executor/model_loader/weight_utils.py
  • python -m ruff check vllm_omni/model_executor/model_loader/weight_utils.py
  • python -m ruff format --check vllm_omni/model_executor/model_loader/weight_utils.py
  • CI: rerun Buildkite qwen3 omni test from build #11013

Made with Cursor

Serialize full-repo Hugging Face snapshot materialization with a repo-wide
node lock and retry incomplete snapshots so CI does not race on shared caches.

Co-authored-by: Cursor <cursoragent@cursor.com>
@tzhouam tzhouam requested a review from gcanlin as a code owner June 4, 2026 07:54
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@tzhouam tzhouam changed the title fix: harden Omni weight snapshot downloads [Bugfix] harden Omni weight snapshot downloads Jun 4, 2026
@tzhouam tzhouam added the ready label to trigger buildkite CI label Jun 4, 2026
@hsliuustc0106 hsliuustc0106 added the nightly-test label to trigger buildkite nightly test CI label Jun 4, 2026
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

the ci fails due to qwen-image & HY-image acc, please check whether they are related

@Gaohan123 Gaohan123 removed nightly-test label to trigger buildkite nightly test CI ready label to trigger buildkite CI labels Jun 4, 2026
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

please check the ci failures again an confirm whether it is not related

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants