-
-
Notifications
You must be signed in to change notification settings - Fork 16.4k
[Bugfix] Fix RuntimeError: Already borrowed by enforcing one tokenizer per thread and one thread in threadpool
#41047
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -82,12 +82,13 @@ def __init__(self, config: "VllmConfig", tokenizer: _T | None) -> None: | |||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
| self.tokenizer = tokenizer | ||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
| # Shared thread pool executor for blocking tokenizer and | ||||||||||||||||||||||||||||||||||
| # multimodal preprocessing operations. The multimodal processor | ||||||||||||||||||||||||||||||||||
| # receives a deep-copied tokenizer (see #36557) so it is safe to | ||||||||||||||||||||||||||||||||||
| # run tokenization and MM preprocessing concurrently. | ||||||||||||||||||||||||||||||||||
| pool_workers = config.model_config.renderer_num_workers | ||||||||||||||||||||||||||||||||||
| self._executor = ThreadPoolExecutor(max_workers=pool_workers) | ||||||||||||||||||||||||||||||||||
| # Shared single-worker thread pool for blocking tokenizer and | ||||||||||||||||||||||||||||||||||
| # multimodal preprocessing operations. | ||||||||||||||||||||||||||||||||||
| self._executor = ThreadPoolExecutor(max_workers=1) | ||||||||||||||||||||||||||||||||||
| # Tokenizer to be used in the executor thread | ||||||||||||||||||||||||||||||||||
| # Deep copy to avoid sharing the tokenizer leading to | ||||||||||||||||||||||||||||||||||
| # "already borrowed" errors (see #36557). | ||||||||||||||||||||||||||||||||||
| self.executor_tokenizer = copy.deepcopy(tokenizer) | ||||||||||||||||||||||||||||||||||
|
Comment on lines
+87
to
+91
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since we can use deep copy to avoid sharing the tokenizer and prevent "already borrowed" errors (see #36557), we should find a way to make all tokenizers use a deep-copied instance (perhaps use a tokenizer pool ), rather than enforcing one tokenizer per thread. We use multithreading in many places, and enforcing one tokenizer per thread here does not completely solve all the problems.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
You're saying we want to create a deep-copy of the tokenizer for each thread in the threadpool rather than limiting threadpool to one thread right? I agree that this is the most flexible solution. I was refering to this in #40949 in
I'm not sure I understand this. IIUC we use multithreading in many places (for mm processor, pooling io, applying chat template, etc.) but they all execute on the same underlying thread pool in
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This is the most flexible solution. +1
We are refactoring the preprocessing part so that we can place the thread pool and tokenizer pool in one place. However, we don't yet know how many tokenizers are being used at large and have gone unnoticed.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. #41181: Adds thread-safety wrapper. Tried implementation with mutex and thread-local copy. Not a huge perf difference in my admittedly limited testing |
||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
| # Multimodal preprocessing is always offloaded to the thread pool | ||||||||||||||||||||||||||||||||||
| # to keep the asyncio event loop responsive under concurrent load. | ||||||||||||||||||||||||||||||||||
|
|
@@ -108,17 +109,14 @@ def __init__(self, config: "VllmConfig", tokenizer: _T | None) -> None: | |||||||||||||||||||||||||||||||||
| if config.model_config.is_multimodal_model: | ||||||||||||||||||||||||||||||||||
| mm_processor_cache = mm_registry.processor_cache_from_config(config) | ||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
| # Deep-copy the tokenizer so the multimodal processor gets its | ||||||||||||||||||||||||||||||||||
| # own Rust tokenizer backend. Without this, concurrent access | ||||||||||||||||||||||||||||||||||
| # from AsyncMicrobatchTokenizer and call_hf_processor causes | ||||||||||||||||||||||||||||||||||
| # "RuntimeError: Already borrowed" from the Rust RefCell. | ||||||||||||||||||||||||||||||||||
| # See: https://github.com/huggingface/tokenizers/issues/537 | ||||||||||||||||||||||||||||||||||
| mm_tokenizer = copy.deepcopy(tokenizer) | ||||||||||||||||||||||||||||||||||
| # Cannot self.executor_tokenizer because the mm processor might | ||||||||||||||||||||||||||||||||||
| # mutate the tokenizer, corrupting the shared tokenizer. | ||||||||||||||||||||||||||||||||||
| self.mm_tokenizer = copy.deepcopy(tokenizer) | ||||||||||||||||||||||||||||||||||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Unfortunately, we can't use For example, vllm/vllm/transformers_utils/processors/deepseek_vl2.py Lines 100 to 115 in a608836
|
||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
| with set_default_torch_num_threads(): | ||||||||||||||||||||||||||||||||||
| self.mm_processor = mm_registry.create_processor( | ||||||||||||||||||||||||||||||||||
| config.model_config, | ||||||||||||||||||||||||||||||||||
| tokenizer=mm_tokenizer, | ||||||||||||||||||||||||||||||||||
| tokenizer=self.mm_tokenizer, | ||||||||||||||||||||||||||||||||||
| cache=mm_processor_cache, | ||||||||||||||||||||||||||||||||||
| ) | ||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
|
|
@@ -130,11 +128,10 @@ def __init__(self, config: "VllmConfig", tokenizer: _T | None) -> None: | |||||||||||||||||||||||||||||||||
| # requests don't pollute the sender cache. | ||||||||||||||||||||||||||||||||||
| ro_cache = mm_registry.processor_only_cache_from_config(config) | ||||||||||||||||||||||||||||||||||
| if ro_cache is not None: | ||||||||||||||||||||||||||||||||||
| ro_tokenizer = copy.deepcopy(tokenizer) | ||||||||||||||||||||||||||||||||||
| with set_default_torch_num_threads(): | ||||||||||||||||||||||||||||||||||
| self._readonly_mm_processor = mm_registry.create_processor( | ||||||||||||||||||||||||||||||||||
| config.model_config, | ||||||||||||||||||||||||||||||||||
| tokenizer=ro_tokenizer, | ||||||||||||||||||||||||||||||||||
| tokenizer=self.mm_tokenizer, | ||||||||||||||||||||||||||||||||||
| cache=ro_cache, | ||||||||||||||||||||||||||||||||||
| ) | ||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
|
|
@@ -152,10 +149,19 @@ def get_tokenizer(self) -> _T: | |||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
| return tokenizer | ||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
| def get_executor_tokenizer(self) -> _T: | ||||||||||||||||||||||||||||||||||
| tokenizer = self.executor_tokenizer | ||||||||||||||||||||||||||||||||||
| if tokenizer is None: | ||||||||||||||||||||||||||||||||||
| raise ValueError( | ||||||||||||||||||||||||||||||||||
| "Executor tokenizer not available when `skip_tokenizer_init=True`" | ||||||||||||||||||||||||||||||||||
| ) | ||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
| return tokenizer | ||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
| def get_async_tokenizer(self) -> AsyncMicrobatchTokenizer: | ||||||||||||||||||||||||||||||||||
| if self._async_tokenizer is None: | ||||||||||||||||||||||||||||||||||
| self._async_tokenizer = AsyncMicrobatchTokenizer( | ||||||||||||||||||||||||||||||||||
| self.get_tokenizer(), executor=self._executor | ||||||||||||||||||||||||||||||||||
| self.get_executor_tokenizer(), executor=self._executor | ||||||||||||||||||||||||||||||||||
| ) | ||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
| return self._async_tokenizer | ||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing
renderer_num_workersfromEngineArgsis a breaking change for the public Python API. Any user code that constructsEngineArgswith this parameter will now fail with aTypeError.Additionally, removing the corresponding CLI argument (previously at line 836) breaks existing deployment scripts. While forcing a single worker fixes the borrowing issue, consider keeping the argument for backward compatibility (potentially ignoring it with a warning) and to allow for future scaling improvements (e.g., using a pool of tokenizers).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. This is a draft for illustrative purposes, will update before readying