Skip to content

[BugFix] Lazy tokenizer init in StructuredOutputManager to prevent GGUF semaphore leak#30409

Closed
kitaekatt wants to merge 1 commit into
vllm-project:mainfrom
kitaekatt:fix/30284-tokenizer-semaphore-leak
Closed

[BugFix] Lazy tokenizer init in StructuredOutputManager to prevent GGUF semaphore leak#30409
kitaekatt wants to merge 1 commit into
vllm-project:mainfrom
kitaekatt:fix/30284-tokenizer-semaphore-leak

Conversation

@kitaekatt
Copy link
Copy Markdown
Contributor

Summary

Fixes semaphore exhaustion when repeatedly loading GGUF models by deferring tokenizer initialization.

Changes

  • Lazy initialization of tokenizer in StructuredOutputManager
  • Prevents resource leak during model loading/unloading cycles

Testing

Tested with repeated model loads - no more semaphore exhaustion.

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly addresses a potential semaphore leak with GGUF models by implementing lazy initialization for the tokenizer in StructuredOutputManager. The use of a thread-safe, double-checked locking pattern is well-suited for this purpose. The changes are well-contained and the rationale is clearly documented. I have one suggestion to enhance robustness by adding an explicit check for a configuration conflict, which will improve error handling and user experience.

Comment thread vllm/v1/structured_output/__init__.py Outdated
Comment on lines +95 to +97
if self.vllm_config.model_config.skip_tokenizer_init:
self._tokenizer_initialized = True
return
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using structured output features requires a tokenizer. If skip_tokenizer_init is True, the tokenizer is not loaded, which will lead to a crash with an unclear error message later when a backend (like XgrammarBackend or GuidanceBackend) is initialized with tokenizer=None.

While the original code also had this issue (it would raise an AttributeError), this pull request provides a good opportunity to make the behavior more robust. Failing early with a clear error message would significantly improve the user experience.

Suggested change
if self.vllm_config.model_config.skip_tokenizer_init:
self._tokenizer_initialized = True
return
if self.vllm_config.model_config.skip_tokenizer_init:
raise RuntimeError(
"Structured output requires a tokenizer, but skip_tokenizer_init is True.")

@kitaekatt
Copy link
Copy Markdown
Contributor Author

This PR is a re-opening of #30284. The original branch was accidentally deleted, preventing that PR from being reopened.

@kitaekatt kitaekatt marked this pull request as draft December 10, 2025 18:29
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Dec 13, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @kitaekatt.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Jan 2, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @kitaekatt.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Jan 2, 2026
@mergify mergify Bot added the bug Something isn't working label Jan 14, 2026
…ore leak

GGUF models without precomputed merges trigger `build_merges_on_the_fly`
in the transformers library, which uses multiprocessing primitives.
When this happens in both the APIServer process (for request validation)
and the EngineCore subprocess (via StructuredOutputManager), the
subprocess leaks a semaphore, causing the server to hang indefinitely.

This change makes tokenizer initialization lazy in StructuredOutputManager:
- Tokenizer is only loaded when grammar_init() is first called
- Most inference requests don't use structured output, so the tokenizer
  in EngineCore is never loaded
- For requests that do use structured output, tokenizer is loaded on-demand
- Added explicit RuntimeError when skip_tokenizer_init=True but structured
  output is requested, providing clear error messaging instead of a later
  AttributeError

The fix resolves the following symptoms:
- Server hangs after "resource_tracker: There appear to be 1 leaked
  semaphore objects to clean up at shutdown"
- Tokenizer merges being built twice (once in APIServer, once in EngineCore)
- GGUF models failing to start even though weights load successfully

Tested with bartowski/Phi-3.5-mini-instruct-GGUF (Q5_K_M).

Signed-off-by: Christina <truffle@gmail.com>
@kitaekatt kitaekatt force-pushed the fix/30284-tokenizer-semaphore-leak branch from e0f3098 to 200ef3c Compare January 19, 2026 17:26
@mergify mergify Bot removed the needs-rebase label Jan 19, 2026
@russellb
Copy link
Copy Markdown
Member

@kitaekatt what's needed here to take it out of draft status?

@kitaekatt
Copy link
Copy Markdown
Contributor Author

@kitaekatt what's needed here to take it out of draft status?

Hey Russell. I will get this one open today. I want to do some more testing but I should be able to complete this testing today. Sorry for the delay!

@kitaekatt
Copy link
Copy Markdown
Contributor Author

Testing performed:

Tested with the following models on RTX 5090 (32GB):

  • google/gemma-3-1b-it
  • microsoft/Phi-3.5-mini-instruct
  • casperhansen/mistral-nemo-instruct-2407-awq

Ran models through a local benchmark runner for HumanEval. Verified code coverage by instrumenting the structured output path to confirm the changes are exercised during benchmark execution.

@kitaekatt kitaekatt marked this pull request as ready for review January 21, 2026 00:17
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

Comment @cursor review or bugbot run to trigger another review on this PR

reasoner_cls = ReasoningParserManager.get_reasoning_parser(reasoning_parser)
self.reasoner = reasoner_cls(tokenizer=self._tokenizer)

self._tokenizer_initialized = True
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ThreadPoolExecutor leaked on initialization failure

Low Severity

In _init_tokenizer, the ThreadPoolExecutor is created at line 108 before calling cached_tokenizer_from_config and subsequent initialization steps. If any of these steps fail, _tokenizer_initialized remains False but self.executor holds a reference to the created executor. On retry (next access to self.tokenizer), a new ThreadPoolExecutor is created and assigned to self.executor, orphaning the previous one without proper shutdown. This can leak thread resources in scenarios with transient initialization failures.

Fix in Cursor Fix in Web

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Jan 27, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @kitaekatt.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Jan 27, 2026
@kitaekatt
Copy link
Copy Markdown
Contributor Author

Closing in favor of a fresh PR rebased on current main. The original lazy-init approach is preserved but rewritten against the current StructuredOutputManager structure which has changed significantly.

@kitaekatt kitaekatt closed this Feb 5, 2026
kitaekatt added a commit to kitaekatt/vllm that referenced this pull request Mar 25, 2026
…ess mode

Defer tokenizer initialization in StructuredOutputManager from __init__
to first access via a property. When GGUF models are loaded in
multiprocess mode, eager tokenizer init builds BPE merges using
multiprocessing primitives that leak semaphores in forked subprocesses,
eventually exhausting the system limit and causing server hangs.

Supersedes vllm-project#30409.

Signed-off-by: Christina <truffle@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants