[Bugfix] Suppress harmless repo_utils ERROR in stage workers for local model paths by Lidang-Jiang · Pull Request #1658 · vllm-project/vllm-omni

Lidang-Jiang · 2026-03-04T12:12:07Z

Summary

When serving models from local filesystem paths (e.g., /ssd1/models/...), each stage worker process logs a spurious ERROR during startup:

[Stage-X] ERROR [repo_utils.py:47] Error retrieving safetensors: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/ssd1/models/...'

This happens because vllm.transformers_utils.repo_utils.file_or_path_exists() attempts to query the HuggingFace Hub for safetensors metadata before falling back to local filesystem checks. When the model argument is a local path (not a valid HF repo ID), the Hub query fails and logs an ERROR — even though the function then correctly falls back to local file detection.

Impact: Pure log noise. The model loads and serves correctly. However, in multi-stage deployments, these spurious ERRORs accumulate and make it harder for operators to spot real issues.

Fix: Set the vllm.transformers_utils.repo_utils logger level to CRITICAL in both _stage_worker() and _stage_worker_async() entry points, after plugin loading and before engine initialization.

Test plan

Start a multi-stage vllm-omni service with a local model path
Verify no repo_utils ERROR messages in logs
Verify model loads and serves correctly
Verify that actual critical errors from other loggers are still visible

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3a9a29a4fd

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-04T12:14:03Z

+    # Suppress harmless repo_utils ERROR for local model paths
+    import logging as _logging
+
+    _logging.getLogger("vllm.transformers_utils.repo_utils").setLevel(_logging.CRITICAL)


Gate repo_utils logger suppression behind local path check

This sets vllm.transformers_utils.repo_utils to CRITICAL unconditionally, so stage workers also suppress legitimate ERROR diagnostics when model is a Hugging Face repo ID (for example, auth/network/revision failures). In production remote-model deployments, those messages are often the only actionable signal before higher-level failures, so this change can mask real initialization issues rather than just removing local-path noise. Please apply this suppression only when the input is a local filesystem path (or filter only the specific known-harmless message).

Useful? React with 👍 / 👎.

Good catch — updated in 7dbcf9d. Both _stage_worker() and _stage_worker_async() now guard the suppression with if _os.path.exists(model):, so HF repo IDs keep their normal logging (auth failures, network errors, etc.).

lishunyang12

Left a comment.

lishunyang12 · 2026-03-04T15:21:49Z

+        _logging.getLogger("vllm.transformers_utils.repo_utils").setLevel(_logging.CRITICAL)
    # IMPORTANT: Ensure vLLM's internal multiprocessing workers (e.g., GPUARWorker /
    # GPUARModelRunner) are spawned with a fork-safe method.
    # Mooncake / gRPC / RDMA and CUDA/NCCL can deadlock under fork-with-threads.


setLevel(CRITICAL) suppresses all ERROR logs from transformers — that's pretty broad. Can you target just the specific message instead?

Good point — updated in 9ebc167. Replaced setLevel(CRITICAL) with a logging.Filter that only suppresses the known-harmless "Error retrieving file list" message. All other ERROR logs (including retry failures) now pass through normally.

Looks good now — the logging.Filter approach is much better than setLevel(CRITICAL). Only suppresses the known-harmless messages while keeping real errors visible. Thanks for the quick fix.

Thanks for the review!

…l model paths Only suppress repo_utils logger when model is a local path (os.path.exists). For HF repo IDs, keep logging enabled so real errors (auth failures, network issues) remain visible. Signed-off-by: Lidang-Jiang <lidangjiang@gmail.com>

…ge init When multiple stages initialize concurrently on different GPUs, their get_open_port() calls can race (TOCTOU) and return the same port, causing EADDRINUSE errors. Add a global file lock that serializes engine initialization across all stages before the existing per-device lock. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lidang-Jiang <lidangjiang@gmail.com>

Lidang-Jiang · 2026-03-05T08:15:58Z

Update

1. Filter approach addresses review feedback

The current implementation (commit 69d235c) uses a targeted logging.Filter guarded by os.path.exists(model):

Only activates for local model paths — HF remote models retain full error logging
Only suppresses two specific message fragments ("Error retrieving file list", "Error retrieving safetensors") — all other ERROR logs from transformers pass through normally
Verified with production logs: repo_utils ERROR messages are gone while all other logging remains intact

This fully addresses @lishunyang12's concern about setLevel(CRITICAL) being too broad.

2. New commit: Global file lock for EADDRINUSE fix

Added commit 725e41e — when multiple stages initialize concurrently on different GPUs, their get_open_port() calls can race (TOCTOU) and return the same port, causing EADDRINUSE. A global file lock now serializes engine initialization across all stages before the existing per-device lock.

Both fixes have been validated end-to-end on our production server.

tts.log
fixed_tts.log

lishunyang12

LGTM — the filter approach is clean and properly scoped. One minor thing: the filter class is defined twice (sync and async workers) — could be deduplicated at module level, but not a blocker.

Move the duplicated _RepoUtilsLocalPathFilter class and guard logic from both _stage_worker() and _stage_worker_async() to a single module-level definition with a _suppress_repo_utils_errors_for_local_path() helper, replacing both inline definitions with a single call. Signed-off-by: Lidang-Jiang <lidangjiang@gmail.com>

Lidang-Jiang · 2026-03-12T14:33:23Z

@lishunyang12 Thanks for the suggestion! Deduplicated in f83a622 — moved _RepoUtilsLocalPathFilter and the guard logic to module level as a _suppress_repo_utils_errors_for_local_path() helper, replacing both inline definitions with a single call.

lishunyang12 · 2026-03-21T23:24:59Z

Nice, thanks for deduplicating that.

lishunyang12 · 2026-04-03T16:02:24Z

Have you tested?

Lidang-Jiang · 2026-04-04T01:39:41Z

After trying to rebase, I found that omni_stage.py was removed entirely in #1908 (Entrypoint Refactoring). The _stage_worker() and _stage_worker_async() functions this PR patched no longer exist.

The new AsyncOmniEngine architecture doesn't spawn separate stage worker processes, so the repeated repo_utils ERROR logs for local model paths should no longer occur.

I'll close this PR since the fix target no longer exists. If the issue resurfaces in the new architecture, I'll open a fresh PR against the current codebase.

Lidang-Jiang requested a review from hsliuustc0106 as a code owner March 4, 2026 12:12

chatgpt-codex-connector Bot reviewed Mar 4, 2026

View reviewed changes

Lidang-Jiang force-pushed the fix/suppress-repo-utils-error-in-stage-workers branch from 3a9a29a to 7dbcf9d Compare March 4, 2026 12:20

lishunyang12 reviewed Mar 4, 2026

View reviewed changes

Lidang-Jiang force-pushed the fix/suppress-repo-utils-error-in-stage-workers branch from 7dbcf9d to 9ebc167 Compare March 5, 2026 07:01

Lidang-Jiang force-pushed the fix/suppress-repo-utils-error-in-stage-workers branch from 9ebc167 to 69d235c Compare March 5, 2026 07:18

linyueqian mentioned this pull request Mar 10, 2026

[RFC]: TTS Development Roadmap - March 2026 #1795

Open

lishunyang12 approved these changes Mar 12, 2026

View reviewed changes

Lidang-Jiang force-pushed the fix/suppress-repo-utils-error-in-stage-workers branch from 3761e09 to 2fb231a Compare March 12, 2026 14:18

Lidang-Jiang force-pushed the fix/suppress-repo-utils-error-in-stage-workers branch from 2fb231a to f83a622 Compare March 12, 2026 14:32

Lidang-Jiang closed this Apr 4, 2026

Conversation

Lidang-Jiang commented Mar 4, 2026

Summary

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Lidang-Jiang Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

lishunyang12 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lishunyang12 Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Lidang-Jiang Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

lishunyang12 Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Lidang-Jiang Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Lidang-Jiang commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Update

1. Filter approach addresses review feedback

2. New commit: Global file lock for EADDRINUSE fix

Uh oh!

lishunyang12 left a comment

Choose a reason for hiding this comment

Uh oh!

Lidang-Jiang commented Mar 12, 2026

Uh oh!

lishunyang12 commented Mar 21, 2026

Uh oh!

lishunyang12 commented Apr 3, 2026

Uh oh!

Lidang-Jiang commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lishunyang12 left a comment •

edited

Loading

Lidang-Jiang commented Mar 5, 2026 •

edited

Loading