[Bugfix] Lazy tokenizer init to prevent semaphore leak in multiprocess mode#33847
[Bugfix] Lazy tokenizer init to prevent semaphore leak in multiprocess mode#33847kitaekatt wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Code Review
This pull request correctly addresses a semaphore leak that occurs during GGUF model loading in multiprocess mode. By deferring the tokenizer initialization in StructuredOutputManager to a lazily-loaded @property, the eager creation of multiprocessing primitives within __init__ is avoided, which was the source of the leak. The reasoning parser's initialization is also correctly deferred, as it depends on the tokenizer. The implementation is clean, follows standard Python patterns for lazy initialization, and includes appropriate error handling for cases where structured output is attempted without a tokenizer. The changes are well-contained and effectively resolve the described issue.
…ess mode Defer tokenizer initialization in StructuredOutputManager from __init__ to first access via a property. When GGUF models are loaded in multiprocess mode, eager tokenizer init builds BPE merges using multiprocessing primitives that leak semaphores in forked subprocesses, eventually exhausting the system limit and causing server hangs. Supersedes vllm-project#30409. Signed-off-by: Christina <truffle@gmail.com>
40d34cb to
dcf8c8c
Compare
|
Closing — not pursuing this approach. The lazy tokenizer init idea isn't the right shape for fixing the semaphore leak; will explore alternatives separately if needed. |
Summary
Supersedes #30409 (rebased and rewritten against current main).
Defers tokenizer initialization in
StructuredOutputManagerfrom__init__to first access via a@property. This prevents semaphore exhaustion when GGUF models are loaded in multiprocess mode.Problem
When
StructuredOutputManager.__init__eagerly callscached_tokenizer_from_config(), the tokenizer builds BPE merges usingmultiprocessingprimitives. In forked subprocesses (multiprocess model loading for GGUF), these primitives leak POSIX semaphores that aren't cleaned up, eventually exhausting the system limit (/proc/sys/kernel/sem) and causing the server to hang.Changes
self.tokenizer = cached_tokenizer_from_config(...)with a@propertythat initializes on first accessThreadPoolExecutorfor grammar compilation is still created eagerly (no multiprocessing primitives)skip_tokenizer_initcheck moved to the property, raising a clear error if structured output is used without a tokenizerTesting
Tested with repeated GGUF model load/unload cycles in multiprocess mode — no semaphore exhaustion.