Skip to content

[Bugfix] Fix RuntimeError: Already borrowed by enforcing one tokenizer per thread and one thread in threadpool#41047

Draft
yzong-rh wants to merge 4 commits intovllm-project:mainfrom
yzong-rh:yzong-rh/executor_tokenizer
Draft

[Bugfix] Fix RuntimeError: Already borrowed by enforcing one tokenizer per thread and one thread in threadpool#41047
yzong-rh wants to merge 4 commits intovllm-project:mainfrom
yzong-rh:yzong-rh/executor_tokenizer

Conversation

@yzong-rh
Copy link
Copy Markdown
Contributor

@yzong-rh yzong-rh commented Apr 27, 2026

Purpose

Fixes #40949 by enforcing one tokenizer per thread: one tokenizer remains on the main thread, and a separate tokenizer copy is used by the renderer executor. This PR also forces the renderer executor to stay single-threaded by removing --renderer-num-workers, avoiding RuntimeError: Already borrowed from concurrent Hugging Face tokenizer access.

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

@mergify mergify Bot added mistral Related to Mistral models bug Something isn't working labels Apr 27, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request removes the renderer_num_workers configuration, restricting the renderer thread pool to a single worker to address thread-safety concerns with the tokenizer and multimodal processor cache. It introduces a deep-copied executor_tokenizer for use within the thread pool. Review feedback highlights that using the get_executor_tokenizer() helper during multimodal processor initialization will cause errors when skip_tokenizer_init=True is enabled, as the processor should handle a missing tokenizer gracefully. Furthermore, the removal of renderer_num_workers from EngineArgs and the CLI constitutes a breaking change for the public API and deployment scripts, suggesting that the parameter should be retained for backward compatibility.

Comment thread vllm/renderers/base.py Outdated
Comment thread vllm/renderers/base.py Outdated
Comment thread vllm/engine/arg_utils.py
MultiModalConfig.mm_encoder_fp8_scale_save_margin
)
io_processor_plugin: str | None = None
renderer_num_workers: int = 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Removing renderer_num_workers from EngineArgs is a breaking change for the public Python API. Any user code that constructs EngineArgs with this parameter will now fail with a TypeError.

Additionally, removing the corresponding CLI argument (previously at line 836) breaks existing deployment scripts. While forcing a single worker fixes the borrowing issue, consider keeping the argument for backward compatibility (potentially ignoring it with a warning) and to allow for future scaling improvements (e.g., using a pool of tokenizers).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. This is a draft for illustrative purposes, will update before readying

…zer per thread

Signed-off-by: Yifan <yzong@redhat.com>
Signed-off-by: Yifan <yzong@redhat.com>
Comment thread vllm/renderers/base.py
Comment on lines +87 to +91
self._executor = ThreadPoolExecutor(max_workers=1)
# Tokenizer to be used in the executor thread
# Deep copy to avoid sharing the tokenizer leading to
# "already borrowed" errors (see #36557).
self.executor_tokenizer = copy.deepcopy(tokenizer)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we can use deep copy to avoid sharing the tokenizer and prevent "already borrowed" errors (see #36557), we should find a way to make all tokenizers use a deep-copied instance (perhaps use a tokenizer pool ), rather than enforcing one tokenizer per thread.

We use multithreading in many places, and enforcing one tokenizer per thread here does not completely solve all the problems.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should find a way to make all tokenizers use a deep-copied instance, rather than enforcing one tokenizer per thread.

You're saying we want to create a deep-copy of the tokenizer for each thread in the threadpool rather than limiting threadpool to one thread right?

I agree that this is the most flexible solution. I was refering to this in #40949 in Keep allowing --renderer-num-workers > 1 but use thread-local tokenizer. I'm working on a draft PR implementing this.

We use multithreading in many places, and enforcing one tokenizer per thread here does not completely solve all the problems.

I'm not sure I understand this. IIUC we use multithreading in many places (for mm processor, pooling io, applying chat template, etc.) but they all execute on the same underlying thread pool in renderer._executor. If we ensure a unique tokenizer for each thread in the thread pool (either by having many copies of the tokenizer or by enforcing thread pool contains only 1 thread), won't we avoid all the problems?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're saying we want to create a deep-copy of the tokenizer for each thread in the threadpool rather than limiting threadpool to one thread right?

This is the most flexible solution. +1

I'm not sure I understand this. IIUC we use multithreading in many places (for mm processor, pooling io, applying chat template, etc.) but they all execute on the same underlying thread pool in renderer._executor. If we ensure a unique tokenizer for each thread in the thread pool (either by having many copies of the tokenizer or by enforcing thread pool contains only 1 thread), won't we avoid all the problems?

We are refactoring the preprocessing part so that we can place the thread pool and tokenizer pool in one place. However, we don't yet know how many tokenizers are being used at large and have gone unnoticed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#41181: Adds thread-safety wrapper. Tried implementation with mutex and thread-local copy. Not a huge perf difference in my admittedly limited testing

@yzong-rh yzong-rh changed the title [Bugfix] Fix RuntimeError: Already borrowed by enforcing one tokenizer per thread [Bugfix] Fix RuntimeError: Already borrowed by enforcing one tokenizer per thread and one thread in threadpool Apr 28, 2026
Signed-off-by: Yifan <yzong@redhat.com>
Signed-off-by: Yifan <yzong@redhat.com>
Comment thread vllm/renderers/base.py
mm_tokenizer = copy.deepcopy(tokenizer)
# Cannot self.executor_tokenizer because the mm processor might
# mutate the tokenizer, corrupting the shared tokenizer.
self.mm_tokenizer = copy.deepcopy(tokenizer)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, we can't use self.executor_tokenizer here because mm_processor may mutate their tokenizer. They were not designed with thread-safety in mind.

For example, DeepseekVLV2Processor

if image_token_id is None:
special_tokens = [image_token]
special_tokens_dict = {"additional_special_tokens": special_tokens}
self.tokenizer.add_special_tokens(special_tokens_dict)
self.image_token_id = self.tokenizer.vocab.get(image_token)
# add five special tokens for grounding-related tasks
# <|ref|>, <|/ref|>, <|det|>, <|/det|>, <|grounding|>
special_tokens = ["<|ref|>", "<|/ref|>", "<|det|>", "<|/det|>", "<|grounding|>"]
special_tokens_dict = {"additional_special_tokens": special_tokens}
self.tokenizer.add_special_tokens(special_tokens_dict)
# add special tokens for SFT data
special_tokens = ["<|User|>", "<|Assistant|>"]
special_tokens_dict = {"additional_special_tokens": special_tokens}
self.tokenizer.add_special_tokens(special_tokens_dict)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working mistral Related to Mistral models

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Huggingface Tokenizer "RuntimeError: Already borrowed".

2 participants