[BugFix] Lazy tokenizer init in StructuredOutputManager to prevent GGUF semaphore leak by kitaekatt · Pull Request #30409 · vllm-project/vllm

kitaekatt · 2025-12-10T17:45:41Z

Summary

Fixes semaphore exhaustion when repeatedly loading GGUF models by deferring tokenizer initialization.

Changes

Lazy initialization of tokenizer in StructuredOutputManager
Prevents resource leak during model loading/unloading cycles

Testing

Tested with repeated model loads - no more semaphore exhaustion.

chatgpt-codex-connector · 2025-12-10T17:45:47Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

gemini-code-assist

Code Review

This pull request correctly addresses a potential semaphore leak with GGUF models by implementing lazy initialization for the tokenizer in StructuredOutputManager. The use of a thread-safe, double-checked locking pattern is well-suited for this purpose. The changes are well-contained and the rationale is clearly documented. I have one suggestion to enhance robustness by adding an explicit check for a configuration conflict, which will improve error handling and user experience.

gemini-code-assist · 2025-12-10T17:47:36Z

+        if self.vllm_config.model_config.skip_tokenizer_init:
+            self._tokenizer_initialized = True
+            return


Using structured output features requires a tokenizer. If skip_tokenizer_init is True, the tokenizer is not loaded, which will lead to a crash with an unclear error message later when a backend (like XgrammarBackend or GuidanceBackend) is initialized with tokenizer=None.

While the original code also had this issue (it would raise an AttributeError), this pull request provides a good opportunity to make the behavior more robust. Failing early with a clear error message would significantly improve the user experience.

Suggested change

if self.vllm_config.model_config.skip_tokenizer_init:

self._tokenizer_initialized = True

return

if self.vllm_config.model_config.skip_tokenizer_init:

raise RuntimeError(

"Structured output requires a tokenizer, but skip_tokenizer_init is True.")

kitaekatt · 2025-12-10T18:07:15Z

This PR is a re-opening of #30284. The original branch was accidentally deleted, preventing that PR from being reopened.

mergify · 2025-12-13T21:13:51Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @kitaekatt.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2026-01-02T03:43:23Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @kitaekatt.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…ore leak GGUF models without precomputed merges trigger `build_merges_on_the_fly` in the transformers library, which uses multiprocessing primitives. When this happens in both the APIServer process (for request validation) and the EngineCore subprocess (via StructuredOutputManager), the subprocess leaks a semaphore, causing the server to hang indefinitely. This change makes tokenizer initialization lazy in StructuredOutputManager: - Tokenizer is only loaded when grammar_init() is first called - Most inference requests don't use structured output, so the tokenizer in EngineCore is never loaded - For requests that do use structured output, tokenizer is loaded on-demand - Added explicit RuntimeError when skip_tokenizer_init=True but structured output is requested, providing clear error messaging instead of a later AttributeError The fix resolves the following symptoms: - Server hangs after "resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown" - Tokenizer merges being built twice (once in APIServer, once in EngineCore) - GGUF models failing to start even though weights load successfully Tested with bartowski/Phi-3.5-mini-instruct-GGUF (Q5_K_M). Signed-off-by: Christina <truffle@gmail.com>

russellb · 2026-01-19T19:13:21Z

@kitaekatt what's needed here to take it out of draft status?

kitaekatt · 2026-01-19T20:00:30Z

@kitaekatt what's needed here to take it out of draft status?

Hey Russell. I will get this one open today. I want to do some more testing but I should be able to complete this testing today. Sorry for the delay!

kitaekatt · 2026-01-21T00:16:53Z

Testing performed:

Tested with the following models on RTX 5090 (32GB):

google/gemma-3-1b-it
microsoft/Phi-3.5-mini-instruct
casperhansen/mistral-nemo-instruct-2407-awq

Ran models through a local benchmark runner for HumanEval. Verified code coverage by instrumenting the structured output path to confirm the changes are exercised during benchmark execution.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

Comment @cursor review or bugbot run to trigger another review on this PR

cursor · 2026-01-21T00:27:59Z

+            reasoner_cls = ReasoningParserManager.get_reasoning_parser(reasoning_parser)
+            self.reasoner = reasoner_cls(tokenizer=self._tokenizer)
+
+        self._tokenizer_initialized = True


ThreadPoolExecutor leaked on initialization failure

Low Severity

In _init_tokenizer, the ThreadPoolExecutor is created at line 108 before calling cached_tokenizer_from_config and subsequent initialization steps. If any of these steps fail, _tokenizer_initialized remains False but self.executor holds a reference to the created executor. On retry (next access to self.tokenizer), a new ThreadPoolExecutor is created and assigned to self.executor, orphaning the previous one without proper shutdown. This can leak thread resources in scenarios with transient initialization failures.

mergify · 2026-01-27T05:01:27Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @kitaekatt.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

kitaekatt · 2026-02-05T00:21:11Z

Closing in favor of a fresh PR rebased on current main. The original lazy-init approach is preserved but rewritten against the current StructuredOutputManager structure which has changed significantly.

…ess mode Defer tokenizer initialization in StructuredOutputManager from __init__ to first access via a property. When GGUF models are loaded in multiprocess mode, eager tokenizer init builds BPE merges using multiprocessing primitives that leak semaphores in forked subprocesses, eventually exhausting the system limit and causing server hangs. Supersedes vllm-project#30409. Signed-off-by: Christina <truffle@gmail.com>

kitaekatt requested review from aarnphm, benchislett, mgoin and russellb as code owners December 10, 2025 17:45

mergify Bot added structured-output v1 labels Dec 10, 2025

github-project-automation Bot added this to Structured Output Dec 10, 2025

mergify Bot mentioned this pull request Dec 10, 2025

[BugFix] Lazy tokenizer init in StructuredOutputManager to prevent GGUF semaphore leak #30284

Closed

4 tasks

gemini-code-assist Bot reviewed Dec 10, 2025

View reviewed changes

kitaekatt marked this pull request as draft December 10, 2025 18:29

mergify Bot added the needs-rebase label Dec 13, 2025

kitaekatt mentioned this pull request Dec 15, 2025

fix(gemma2): Skip missing parameters during GGUF weight loading #30421

Closed

3 tasks

kitaekatt force-pushed the fix/30284-tokenizer-semaphore-leak branch from a72d1f9 to 9a1b4ec Compare December 15, 2025 15:45

mergify Bot removed the needs-rebase label Dec 15, 2025

kitaekatt mentioned this pull request Dec 15, 2025

fix(gguf): GGUF model support fixes for Blackwell GPUs #30497

Closed

4 tasks

kitaekatt force-pushed the fix/30284-tokenizer-semaphore-leak branch from 9a1b4ec to e0f3098 Compare December 29, 2025 20:39

mergify Bot added the needs-rebase label Jan 2, 2026

mergify Bot added the bug Something isn't working label Jan 14, 2026

kitaekatt force-pushed the fix/30284-tokenizer-semaphore-leak branch from e0f3098 to 200ef3c Compare January 19, 2026 17:26

mergify Bot removed the needs-rebase label Jan 19, 2026

kitaekatt marked this pull request as ready for review January 21, 2026 00:17

cursor Bot reviewed Jan 21, 2026

View reviewed changes

mergify Bot added the needs-rebase label Jan 27, 2026

kitaekatt closed this Feb 5, 2026

github-project-automation Bot moved this to Done in Structured Output Feb 5, 2026

kitaekatt mentioned this pull request Feb 5, 2026

[Bugfix] Lazy tokenizer init to prevent semaphore leak in multiprocess mode #33847

Closed

Uh oh!

Conversation

kitaekatt commented Dec 10, 2025

Summary

Changes

Testing

Uh oh!

chatgpt-codex-connector Bot commented Dec 10, 2025

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

kitaekatt commented Dec 10, 2025

Uh oh!

mergify Bot commented Dec 13, 2025

Uh oh!

mergify Bot commented Jan 2, 2026

Uh oh!

russellb commented Jan 19, 2026

Uh oh!

kitaekatt commented Jan 19, 2026

Uh oh!

kitaekatt commented Jan 21, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jan 21, 2026

Choose a reason for hiding this comment

ThreadPoolExecutor leaked on initialization failure

Uh oh!

mergify Bot commented Jan 27, 2026

Uh oh!

kitaekatt commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants