[Bugfix] Fix RuntimeError: Already borrowed by enforcing one tokenizer per thread and one thread in threadpool#41047
Conversation
There was a problem hiding this comment.
Code Review
This pull request removes the renderer_num_workers configuration, restricting the renderer thread pool to a single worker to address thread-safety concerns with the tokenizer and multimodal processor cache. It introduces a deep-copied executor_tokenizer for use within the thread pool. Review feedback highlights that using the get_executor_tokenizer() helper during multimodal processor initialization will cause errors when skip_tokenizer_init=True is enabled, as the processor should handle a missing tokenizer gracefully. Furthermore, the removal of renderer_num_workers from EngineArgs and the CLI constitutes a breaking change for the public API and deployment scripts, suggesting that the parameter should be retained for backward compatibility.
| MultiModalConfig.mm_encoder_fp8_scale_save_margin | ||
| ) | ||
| io_processor_plugin: str | None = None | ||
| renderer_num_workers: int = 1 |
There was a problem hiding this comment.
Removing renderer_num_workers from EngineArgs is a breaking change for the public Python API. Any user code that constructs EngineArgs with this parameter will now fail with a TypeError.
Additionally, removing the corresponding CLI argument (previously at line 836) breaks existing deployment scripts. While forcing a single worker fixes the borrowing issue, consider keeping the argument for backward compatibility (potentially ignoring it with a warning) and to allow for future scaling improvements (e.g., using a pool of tokenizers).
There was a problem hiding this comment.
Good point. This is a draft for illustrative purposes, will update before readying
…zer per thread Signed-off-by: Yifan <yzong@redhat.com>
1e687a9 to
cd7560d
Compare
Signed-off-by: Yifan <yzong@redhat.com>
| self._executor = ThreadPoolExecutor(max_workers=1) | ||
| # Tokenizer to be used in the executor thread | ||
| # Deep copy to avoid sharing the tokenizer leading to | ||
| # "already borrowed" errors (see #36557). | ||
| self.executor_tokenizer = copy.deepcopy(tokenizer) |
There was a problem hiding this comment.
Since we can use deep copy to avoid sharing the tokenizer and prevent "already borrowed" errors (see #36557), we should find a way to make all tokenizers use a deep-copied instance (perhaps use a tokenizer pool ), rather than enforcing one tokenizer per thread.
We use multithreading in many places, and enforcing one tokenizer per thread here does not completely solve all the problems.
There was a problem hiding this comment.
we should find a way to make all tokenizers use a deep-copied instance, rather than enforcing one tokenizer per thread.
You're saying we want to create a deep-copy of the tokenizer for each thread in the threadpool rather than limiting threadpool to one thread right?
I agree that this is the most flexible solution. I was refering to this in #40949 in Keep allowing --renderer-num-workers > 1 but use thread-local tokenizer. I'm working on a draft PR implementing this.
We use multithreading in many places, and enforcing one tokenizer per thread here does not completely solve all the problems.
I'm not sure I understand this. IIUC we use multithreading in many places (for mm processor, pooling io, applying chat template, etc.) but they all execute on the same underlying thread pool in renderer._executor. If we ensure a unique tokenizer for each thread in the thread pool (either by having many copies of the tokenizer or by enforcing thread pool contains only 1 thread), won't we avoid all the problems?
There was a problem hiding this comment.
You're saying we want to create a deep-copy of the tokenizer for each thread in the threadpool rather than limiting threadpool to one thread right?
This is the most flexible solution. +1
I'm not sure I understand this. IIUC we use multithreading in many places (for mm processor, pooling io, applying chat template, etc.) but they all execute on the same underlying thread pool in renderer._executor. If we ensure a unique tokenizer for each thread in the thread pool (either by having many copies of the tokenizer or by enforcing thread pool contains only 1 thread), won't we avoid all the problems?
We are refactoring the preprocessing part so that we can place the thread pool and tokenizer pool in one place. However, we don't yet know how many tokenizers are being used at large and have gone unnoticed.
There was a problem hiding this comment.
#41181: Adds thread-safety wrapper. Tried implementation with mutex and thread-local copy. Not a huge perf difference in my admittedly limited testing
RuntimeError: Already borrowed by enforcing one tokenizer per threadRuntimeError: Already borrowed by enforcing one tokenizer per thread and one thread in threadpool
Signed-off-by: Yifan <yzong@redhat.com>
Signed-off-by: Yifan <yzong@redhat.com>
| mm_tokenizer = copy.deepcopy(tokenizer) | ||
| # Cannot self.executor_tokenizer because the mm processor might | ||
| # mutate the tokenizer, corrupting the shared tokenizer. | ||
| self.mm_tokenizer = copy.deepcopy(tokenizer) |
There was a problem hiding this comment.
Unfortunately, we can't use self.executor_tokenizer here because mm_processor may mutate their tokenizer. They were not designed with thread-safety in mind.
For example, DeepseekVLV2Processor
vllm/vllm/transformers_utils/processors/deepseek_vl2.py
Lines 100 to 115 in a608836
Purpose
Fixes #40949 by enforcing one tokenizer per thread: one tokenizer remains on the main thread, and a separate tokenizer copy is used by the renderer executor. This PR also forces the renderer executor to stay single-threaded by removing
--renderer-num-workers, avoidingRuntimeError: Already borrowedfrom concurrent Hugging Face tokenizer access.Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.