Skip to content

[Bugfix] Suppress harmless repo_utils ERROR in stage workers for local model paths#1658

Closed
Lidang-Jiang wants to merge 3 commits into
vllm-project:mainfrom
Lidang-Jiang:fix/suppress-repo-utils-error-in-stage-workers
Closed

[Bugfix] Suppress harmless repo_utils ERROR in stage workers for local model paths#1658
Lidang-Jiang wants to merge 3 commits into
vllm-project:mainfrom
Lidang-Jiang:fix/suppress-repo-utils-error-in-stage-workers

Conversation

@Lidang-Jiang
Copy link
Copy Markdown
Contributor

Summary

When serving models from local filesystem paths (e.g., /ssd1/models/...), each stage worker process logs a spurious ERROR during startup:

[Stage-X] ERROR [repo_utils.py:47] Error retrieving safetensors: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/ssd1/models/...'

This happens because vllm.transformers_utils.repo_utils.file_or_path_exists() attempts to query the HuggingFace Hub for safetensors metadata before falling back to local filesystem checks. When the model argument is a local path (not a valid HF repo ID), the Hub query fails and logs an ERROR — even though the function then correctly falls back to local file detection.

Impact: Pure log noise. The model loads and serves correctly. However, in multi-stage deployments, these spurious ERRORs accumulate and make it harder for operators to spot real issues.

Fix: Set the vllm.transformers_utils.repo_utils logger level to CRITICAL in both _stage_worker() and _stage_worker_async() entry points, after plugin loading and before engine initialization.

Test plan

  • Start a multi-stage vllm-omni service with a local model path
  • Verify no repo_utils ERROR messages in logs
  • Verify model loads and serves correctly
  • Verify that actual critical errors from other loggers are still visible

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3a9a29a4fd

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread vllm_omni/entrypoints/omni_stage.py Outdated
# Suppress harmless repo_utils ERROR for local model paths
import logging as _logging

_logging.getLogger("vllm.transformers_utils.repo_utils").setLevel(_logging.CRITICAL)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Gate repo_utils logger suppression behind local path check

This sets vllm.transformers_utils.repo_utils to CRITICAL unconditionally, so stage workers also suppress legitimate ERROR diagnostics when model is a Hugging Face repo ID (for example, auth/network/revision failures). In production remote-model deployments, those messages are often the only actionable signal before higher-level failures, so this change can mask real initialization issues rather than just removing local-path noise. Please apply this suppression only when the input is a local filesystem path (or filter only the specific known-harmless message).

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — updated in 7dbcf9d. Both _stage_worker() and _stage_worker_async() now guard the suppression with if _os.path.exists(model):, so HF repo IDs keep their normal logging (auth failures, network errors, etc.).

@Lidang-Jiang Lidang-Jiang force-pushed the fix/suppress-repo-utils-error-in-stage-workers branch from 3a9a29a to 7dbcf9d Compare March 4, 2026 12:20
Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a comment.

_logging.getLogger("vllm.transformers_utils.repo_utils").setLevel(_logging.CRITICAL)
# IMPORTANT: Ensure vLLM's internal multiprocessing workers (e.g., GPUARWorker /
# GPUARModelRunner) are spawned with a fork-safe method.
# Mooncake / gRPC / RDMA and CUDA/NCCL can deadlock under fork-with-threads.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setLevel(CRITICAL) suppresses all ERROR logs from transformers — that's pretty broad. Can you target just the specific message instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point — updated in 9ebc167. Replaced setLevel(CRITICAL) with a logging.Filter that only suppresses the known-harmless "Error retrieving file list" message. All other ERROR logs (including retry failures) now pass through normally.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now — the logging.Filter approach is much better than setLevel(CRITICAL). Only suppresses the known-harmless messages while keeping real errors visible. Thanks for the quick fix.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review!

@Lidang-Jiang Lidang-Jiang force-pushed the fix/suppress-repo-utils-error-in-stage-workers branch from 7dbcf9d to 9ebc167 Compare March 5, 2026 07:01
…l model paths

Only suppress repo_utils logger when model is a local path (os.path.exists).
For HF repo IDs, keep logging enabled so real errors (auth failures,
network issues) remain visible.

Signed-off-by: Lidang-Jiang <lidangjiang@gmail.com>
@Lidang-Jiang Lidang-Jiang force-pushed the fix/suppress-repo-utils-error-in-stage-workers branch from 9ebc167 to 69d235c Compare March 5, 2026 07:18
…ge init

When multiple stages initialize concurrently on different GPUs,
their get_open_port() calls can race (TOCTOU) and return the same
port, causing EADDRINUSE errors. Add a global file lock that
serializes engine initialization across all stages before the
existing per-device lock.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Lidang-Jiang <lidangjiang@gmail.com>
@Lidang-Jiang
Copy link
Copy Markdown
Contributor Author

Lidang-Jiang commented Mar 5, 2026

Update

1. Filter approach addresses review feedback

The current implementation (commit 69d235c) uses a targeted logging.Filter guarded by os.path.exists(model):

  • Only activates for local model paths — HF remote models retain full error logging
  • Only suppresses two specific message fragments ("Error retrieving file list", "Error retrieving safetensors") — all other ERROR logs from transformers pass through normally
  • Verified with production logs: repo_utils ERROR messages are gone while all other logging remains intact

This fully addresses @lishunyang12's concern about setLevel(CRITICAL) being too broad.

2. New commit: Global file lock for EADDRINUSE fix

Added commit 725e41e — when multiple stages initialize concurrently on different GPUs, their get_open_port() calls can race (TOCTOU) and return the same port, causing EADDRINUSE. A global file lock now serializes engine initialization across all stages before the existing per-device lock.

Both fixes have been validated end-to-end on our production server.

tts.log
fixed_tts.log

Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — the filter approach is clean and properly scoped. One minor thing: the filter class is defined twice (sync and async workers) — could be deduplicated at module level, but not a blocker.

@Lidang-Jiang Lidang-Jiang force-pushed the fix/suppress-repo-utils-error-in-stage-workers branch from 3761e09 to 2fb231a Compare March 12, 2026 14:18
Move the duplicated _RepoUtilsLocalPathFilter class and guard logic
from both _stage_worker() and _stage_worker_async() to a single
module-level definition with a _suppress_repo_utils_errors_for_local_path()
helper, replacing both inline definitions with a single call.

Signed-off-by: Lidang-Jiang <lidangjiang@gmail.com>
@Lidang-Jiang Lidang-Jiang force-pushed the fix/suppress-repo-utils-error-in-stage-workers branch from 2fb231a to f83a622 Compare March 12, 2026 14:32
@Lidang-Jiang
Copy link
Copy Markdown
Contributor Author

@lishunyang12 Thanks for the suggestion! Deduplicated in f83a622 — moved _RepoUtilsLocalPathFilter and the guard logic to module level as a _suppress_repo_utils_errors_for_local_path() helper, replacing both inline definitions with a single call.

@lishunyang12
Copy link
Copy Markdown
Collaborator

Nice, thanks for deduplicating that.

@lishunyang12
Copy link
Copy Markdown
Collaborator

Have you tested?

@Lidang-Jiang
Copy link
Copy Markdown
Contributor Author

After trying to rebase, I found that omni_stage.py was removed entirely in #1908 (Entrypoint Refactoring). The _stage_worker() and _stage_worker_async() functions this PR patched no longer exist.

The new AsyncOmniEngine architecture doesn't spawn separate stage worker processes, so the repeated repo_utils ERROR logs for local model paths should no longer occur.

I'll close this PR since the fix target no longer exists. If the issue resurfaces in the new architecture, I'll open a fresh PR against the current codebase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants