[BugFix] Lazy tokenizer init in StructuredOutputManager to prevent GGUF semaphore leak#30409
[BugFix] Lazy tokenizer init in StructuredOutputManager to prevent GGUF semaphore leak#30409kitaekatt wants to merge 1 commit into
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
There was a problem hiding this comment.
Code Review
This pull request correctly addresses a potential semaphore leak with GGUF models by implementing lazy initialization for the tokenizer in StructuredOutputManager. The use of a thread-safe, double-checked locking pattern is well-suited for this purpose. The changes are well-contained and the rationale is clearly documented. I have one suggestion to enhance robustness by adding an explicit check for a configuration conflict, which will improve error handling and user experience.
| if self.vllm_config.model_config.skip_tokenizer_init: | ||
| self._tokenizer_initialized = True | ||
| return |
There was a problem hiding this comment.
Using structured output features requires a tokenizer. If skip_tokenizer_init is True, the tokenizer is not loaded, which will lead to a crash with an unclear error message later when a backend (like XgrammarBackend or GuidanceBackend) is initialized with tokenizer=None.
While the original code also had this issue (it would raise an AttributeError), this pull request provides a good opportunity to make the behavior more robust. Failing early with a clear error message would significantly improve the user experience.
| if self.vllm_config.model_config.skip_tokenizer_init: | |
| self._tokenizer_initialized = True | |
| return | |
| if self.vllm_config.model_config.skip_tokenizer_init: | |
| raise RuntimeError( | |
| "Structured output requires a tokenizer, but skip_tokenizer_init is True.") |
|
This PR is a re-opening of #30284. The original branch was accidentally deleted, preventing that PR from being reopened. |
|
This pull request has merge conflicts that must be resolved before it can be |
a72d1f9 to
9a1b4ec
Compare
9a1b4ec to
e0f3098
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
…ore leak GGUF models without precomputed merges trigger `build_merges_on_the_fly` in the transformers library, which uses multiprocessing primitives. When this happens in both the APIServer process (for request validation) and the EngineCore subprocess (via StructuredOutputManager), the subprocess leaks a semaphore, causing the server to hang indefinitely. This change makes tokenizer initialization lazy in StructuredOutputManager: - Tokenizer is only loaded when grammar_init() is first called - Most inference requests don't use structured output, so the tokenizer in EngineCore is never loaded - For requests that do use structured output, tokenizer is loaded on-demand - Added explicit RuntimeError when skip_tokenizer_init=True but structured output is requested, providing clear error messaging instead of a later AttributeError The fix resolves the following symptoms: - Server hangs after "resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown" - Tokenizer merges being built twice (once in APIServer, once in EngineCore) - GGUF models failing to start even though weights load successfully Tested with bartowski/Phi-3.5-mini-instruct-GGUF (Q5_K_M). Signed-off-by: Christina <truffle@gmail.com>
e0f3098 to
200ef3c
Compare
|
@kitaekatt what's needed here to take it out of draft status? |
Hey Russell. I will get this one open today. I want to do some more testing but I should be able to complete this testing today. Sorry for the delay! |
|
Testing performed: Tested with the following models on RTX 5090 (32GB):
Ran models through a local benchmark runner for HumanEval. Verified code coverage by instrumenting the structured output path to confirm the changes are exercised during benchmark execution. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
Comment @cursor review or bugbot run to trigger another review on this PR
| reasoner_cls = ReasoningParserManager.get_reasoning_parser(reasoning_parser) | ||
| self.reasoner = reasoner_cls(tokenizer=self._tokenizer) | ||
|
|
||
| self._tokenizer_initialized = True |
There was a problem hiding this comment.
ThreadPoolExecutor leaked on initialization failure
Low Severity
In _init_tokenizer, the ThreadPoolExecutor is created at line 108 before calling cached_tokenizer_from_config and subsequent initialization steps. If any of these steps fail, _tokenizer_initialized remains False but self.executor holds a reference to the created executor. On retry (next access to self.tokenizer), a new ThreadPoolExecutor is created and assigned to self.executor, orphaning the previous one without proper shutdown. This can leak thread resources in scenarios with transient initialization failures.
|
This pull request has merge conflicts that must be resolved before it can be |
|
Closing in favor of a fresh PR rebased on current main. The original lazy-init approach is preserved but rewritten against the current StructuredOutputManager structure which has changed significantly. |
…ess mode Defer tokenizer initialization in StructuredOutputManager from __init__ to first access via a property. When GGUF models are loaded in multiprocess mode, eager tokenizer init builds BPE merges using multiprocessing primitives that leak semaphores in forked subprocesses, eventually exhausting the system limit and causing server hangs. Supersedes vllm-project#30409. Signed-off-by: Christina <truffle@gmail.com>
Summary
Fixes semaphore exhaustion when repeatedly loading GGUF models by deferring tokenizer initialization.
Changes
Testing
Tested with repeated model loads - no more semaphore exhaustion.