[Bugfix] Suppress harmless repo_utils ERROR in stage workers for local model paths#1658
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3a9a29a4fd
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| # Suppress harmless repo_utils ERROR for local model paths | ||
| import logging as _logging | ||
|
|
||
| _logging.getLogger("vllm.transformers_utils.repo_utils").setLevel(_logging.CRITICAL) |
There was a problem hiding this comment.
Gate repo_utils logger suppression behind local path check
This sets vllm.transformers_utils.repo_utils to CRITICAL unconditionally, so stage workers also suppress legitimate ERROR diagnostics when model is a Hugging Face repo ID (for example, auth/network/revision failures). In production remote-model deployments, those messages are often the only actionable signal before higher-level failures, so this change can mask real initialization issues rather than just removing local-path noise. Please apply this suppression only when the input is a local filesystem path (or filter only the specific known-harmless message).
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Good catch — updated in 7dbcf9d. Both _stage_worker() and _stage_worker_async() now guard the suppression with if _os.path.exists(model):, so HF repo IDs keep their normal logging (auth failures, network errors, etc.).
3a9a29a to
7dbcf9d
Compare
| _logging.getLogger("vllm.transformers_utils.repo_utils").setLevel(_logging.CRITICAL) | ||
| # IMPORTANT: Ensure vLLM's internal multiprocessing workers (e.g., GPUARWorker / | ||
| # GPUARModelRunner) are spawned with a fork-safe method. | ||
| # Mooncake / gRPC / RDMA and CUDA/NCCL can deadlock under fork-with-threads. |
There was a problem hiding this comment.
setLevel(CRITICAL) suppresses all ERROR logs from transformers — that's pretty broad. Can you target just the specific message instead?
There was a problem hiding this comment.
Good point — updated in 9ebc167. Replaced setLevel(CRITICAL) with a logging.Filter that only suppresses the known-harmless "Error retrieving file list" message. All other ERROR logs (including retry failures) now pass through normally.
There was a problem hiding this comment.
Looks good now — the logging.Filter approach is much better than setLevel(CRITICAL). Only suppresses the known-harmless messages while keeping real errors visible. Thanks for the quick fix.
There was a problem hiding this comment.
Thanks for the review!
7dbcf9d to
9ebc167
Compare
…l model paths Only suppress repo_utils logger when model is a local path (os.path.exists). For HF repo IDs, keep logging enabled so real errors (auth failures, network issues) remain visible. Signed-off-by: Lidang-Jiang <lidangjiang@gmail.com>
9ebc167 to
69d235c
Compare
…ge init When multiple stages initialize concurrently on different GPUs, their get_open_port() calls can race (TOCTOU) and return the same port, causing EADDRINUSE errors. Add a global file lock that serializes engine initialization across all stages before the existing per-device lock. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lidang-Jiang <lidangjiang@gmail.com>
Update1. Filter approach addresses review feedbackThe current implementation (commit
This fully addresses @lishunyang12's concern about 2. New commit: Global file lock for EADDRINUSE fixAdded commit Both fixes have been validated end-to-end on our production server. |
lishunyang12
left a comment
There was a problem hiding this comment.
LGTM — the filter approach is clean and properly scoped. One minor thing: the filter class is defined twice (sync and async workers) — could be deduplicated at module level, but not a blocker.
3761e09 to
2fb231a
Compare
Move the duplicated _RepoUtilsLocalPathFilter class and guard logic from both _stage_worker() and _stage_worker_async() to a single module-level definition with a _suppress_repo_utils_errors_for_local_path() helper, replacing both inline definitions with a single call. Signed-off-by: Lidang-Jiang <lidangjiang@gmail.com>
2fb231a to
f83a622
Compare
|
@lishunyang12 Thanks for the suggestion! Deduplicated in f83a622 — moved |
|
Nice, thanks for deduplicating that. |
|
Have you tested? |
|
After trying to rebase, I found that The new I'll close this PR since the fix target no longer exists. If the issue resurfaces in the new architecture, I'll open a fresh PR against the current codebase. |
Summary
When serving models from local filesystem paths (e.g.,
/ssd1/models/...), each stage worker process logs a spurious ERROR during startup:This happens because
vllm.transformers_utils.repo_utils.file_or_path_exists()attempts to query the HuggingFace Hub for safetensors metadata before falling back to local filesystem checks. When the model argument is a local path (not a valid HF repo ID), the Hub query fails and logs an ERROR — even though the function then correctly falls back to local file detection.Impact: Pure log noise. The model loads and serves correctly. However, in multi-stage deployments, these spurious ERRORs accumulate and make it harder for operators to spot real issues.
Fix: Set the
vllm.transformers_utils.repo_utilslogger level to CRITICAL in both_stage_worker()and_stage_worker_async()entry points, after plugin loading and before engine initialization.Test plan
repo_utilsERROR messages in logs