Skip to content

[Bugfix] Lazy tokenizer init to prevent semaphore leak in multiprocess mode#33847

Closed
kitaekatt wants to merge 1 commit into
vllm-project:mainfrom
kitaekatt:fix/lazy-tokenizer-structured-output
Closed

[Bugfix] Lazy tokenizer init to prevent semaphore leak in multiprocess mode#33847
kitaekatt wants to merge 1 commit into
vllm-project:mainfrom
kitaekatt:fix/lazy-tokenizer-structured-output

Conversation

@kitaekatt
Copy link
Copy Markdown
Contributor

Summary

Supersedes #30409 (rebased and rewritten against current main).

Defers tokenizer initialization in StructuredOutputManager from __init__ to first access via a @property. This prevents semaphore exhaustion when GGUF models are loaded in multiprocess mode.

Problem

When StructuredOutputManager.__init__ eagerly calls cached_tokenizer_from_config(), the tokenizer builds BPE merges using multiprocessing primitives. In forked subprocesses (multiprocess model loading for GGUF), these primitives leak POSIX semaphores that aren't cleaned up, eventually exhausting the system limit (/proc/sys/kernel/sem) and causing the server to hang.

Changes

  • Replace eager self.tokenizer = cached_tokenizer_from_config(...) with a @property that initializes on first access
  • Reasoning parser init is also deferred (depends on tokenizer)
  • ThreadPoolExecutor for grammar compilation is still created eagerly (no multiprocessing primitives)
  • skip_tokenizer_init check moved to the property, raising a clear error if structured output is used without a tokenizer

Testing

Tested with repeated GGUF model load/unload cycles in multiprocess mode — no semaphore exhaustion.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly addresses a semaphore leak that occurs during GGUF model loading in multiprocess mode. By deferring the tokenizer initialization in StructuredOutputManager to a lazily-loaded @property, the eager creation of multiprocessing primitives within __init__ is avoided, which was the source of the leak. The reasoning parser's initialization is also correctly deferred, as it depends on the tokenizer. The implementation is clean, follows standard Python patterns for lazy initialization, and includes appropriate error handling for cases where structured output is attempted without a tokenizer. The changes are well-contained and effectively resolve the described issue.

…ess mode

Defer tokenizer initialization in StructuredOutputManager from __init__
to first access via a property. When GGUF models are loaded in
multiprocess mode, eager tokenizer init builds BPE merges using
multiprocessing primitives that leak semaphores in forked subprocesses,
eventually exhausting the system limit and causing server hangs.

Supersedes vllm-project#30409.

Signed-off-by: Christina <truffle@gmail.com>
@kitaekatt kitaekatt force-pushed the fix/lazy-tokenizer-structured-output branch from 40d34cb to dcf8c8c Compare March 25, 2026 17:28
@kitaekatt
Copy link
Copy Markdown
Contributor Author

Closing — not pursuing this approach. The lazy tokenizer init idea isn't the right shape for fixing the semaphore leak; will explore alternatives separately if needed.

@kitaekatt kitaekatt closed this Apr 18, 2026
@kitaekatt kitaekatt deleted the fix/lazy-tokenizer-structured-output branch April 18, 2026 16:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working structured-output v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

1 participant