[Bugfix] Fix RuntimeError: Already borrowed by adding thread-safe Hugging Face fast-tokenizer wrappers#41181
[Bugfix] Fix RuntimeError: Already borrowed by adding thread-safe Hugging Face fast-tokenizer wrappers#41181yzong-rh wants to merge 1 commit intovllm-project:mainfrom
RuntimeError: Already borrowed by adding thread-safe Hugging Face fast-tokenizer wrappers#41181Conversation
|
Updated other tokenizer paths that use HF Fast tokenizer. |
There was a problem hiding this comment.
Code Review
This pull request implements thread-safe wrappers for Hugging Face tokenizers, providing both lock-based and thread-local copy implementations controlled by an environment variable. It also removes previous deepcopy workarounds and retry logic for tokenizer access. Feedback identifies a regression in default thread safety for multimodal models, a missing batch_decode override in the lock-based wrapper, and the need to remove commented-out code. Additionally, it is recommended to use the VLLM_ prefix for the new environment variable and replace print statements with the project's logging system.
| if not isinstance(tokenizer, PreTrainedTokenizerFast): | ||
| return tokenizer | ||
|
|
||
| thread_safe_tokenizer = copy.copy(tokenizer) | ||
| lock = threading.RLock() | ||
|
|
There was a problem hiding this comment.
We have confirmed that having each thread use its own deepcopy(tokenizer) can resolve the "RuntimeError: Already borrowed", so there is no need to add a lock to protect it.
There was a problem hiding this comment.
get_thread_safe_hf_tokenizer_w_lock doesn't deepcopy. This one only uses lock. I left it in for performance comparison. Will need to remove before readying
The deepcopy version is in get_thread_safe_hf_tokenizer_w_copy. This one doesn't use lock.
Both solve the already borrowed issue.
There was a problem hiding this comment.
I didn't see a huge perf difference between them even with --api-server-count=4. I need more profiling but if perf is similar I actually lean more towards the lock implementation as it handles mutation better (less unexpected behavior if tokenizer is mutated). Wdyt?
There was a problem hiding this comment.
I think we should avoid using lock in the case there is only one thread. Using deepcopy would be cleaner in terms of code IMO
There was a problem hiding this comment.
In the short term, (maybe) due to the Python GIL, setting --renderer_num_workers > 1 and using a thread pool provides almost no performance improvement, so setting it to 1 is reasonable.
But in the long term, we still want to improve performance through multithreading — for example, by upgrading to Python 3.14, or by finding some way to bypass the GIL. So personally, I still want to keep the --renderer_num_workers parameter.
Because we don't know exactly what is happening during the preprocessing stage, it's very difficult to optimize further. So, next month I plan to first add better observability for the entrypoints. Perhaps this will help us better understand what is actually going on.
PTAL #39979
There was a problem hiding this comment.
Summarizing what we know so far:
-
Offload blocking tokenizer ops to shared thread pool to unblock event loop [Bugfix] Offload blocking tokenizer ops to shared thread pool to unblock event loop #34789 [Frontend] Offload blocking preprocessing & postprocessing ops to thread pool for pooling entrypoints. #39763
-
(maybe) due to the Python GIL, setting --renderer_num_workers > 1 and using a thread pool provides almost no performance improvement [Bugfix] Offload blocking tokenizer ops to shared thread pool to unblock event loop #34789 (comment) -
https://github.com/noooop/snippet/blob/main/benchmarks/thread_pool/ the thread pool can accelerate torch.mm and tokenizer.encode, but it cannot accelerate torch.mm.
-
Hugging Face fast-tokenizer is not thread-safe, which may cause RuntimeError: Already borrowed. deepcopy(tokenizer) can resolve this. [Bugfix] Fix
RuntimeError: Already borrowedthat degrades VLM serving throughput under concurrent load. #36557 -
Using deepcopy(tokenizer) on every call in a multithreaded environment will certainly introduce some overhead. We need a tokenizer pool. https://github.com/vllm-project/vllm/pull/28458/changes#r3159044061
There was a problem hiding this comment.
from concurrent.futures import ProcessPoolExecutor
tokenizer_encode n_workers: 1, e2e: 12.22839603600005
tokenizer_encode n_workers: 2, e2e: 6.675321902999713
tokenizer_encode n_workers: 4, e2e: 3.770936963999702
tokenizer_encode n_workers: 8, e2e: 2.363701287999902
tokenizer_encode n_workers: 16, e2e: 2.231783658000495
from concurrent.futures import ThreadPoolExecutor
tokenizer_encode n_workers: 1, e2e: 11.007459864000339
tokenizer_encode n_workers: 2, e2e: 5.999560164000286
tokenizer_encode n_workers: 4, e2e: 3.4470128889997795
tokenizer_encode n_workers: 8, e2e: 2.0952387249999447
tokenizer_encode n_workers: 16, e2e: 1.9648324209993007
ThreadPoolExecutor + deepcopy
tokenizer_encode n_workers: 1, e2e: 29.1953536609999
tokenizer_encode n_workers: 2, e2e: 18.48894192500029
tokenizer_encode n_workers: 4, e2e: 18.999746747000245
tokenizer_encode n_workers: 8, e2e: 18.900613595999857
tokenizer_encode n_workers: 16, e2e: 19.041412347000005
I believe the tokenizer benefits from ThreadPoolExecutor, but we cannot afford to use deepcopy on every execution, as it would introduce significant overhead. So we need to add a tokenizer pool to avoid repeated deepcopy.
38224d7 to
347112f
Compare
| backend_tokenizer = thread_safe_tokenizer._tokenizer | ||
|
|
||
| # Concurrent dict insertion is safe here thanks to the GIL. | ||
| thread_local = {threading.get_ident(): copy.deepcopy(backend_tokenizer)} |
There was a problem hiding this comment.
We can switch to threading.local() easily when we no longer rely on the GIL.
threading.local() uses a small lock internally (according to AI) and does not scale as well in microbenchmark.
There was a problem hiding this comment.
Can we have a little less hackiness?
There was a problem hiding this comment.
I think using Queue() might be a bit simpler, just like https://github.com/vllm-project/vllm/pull/28458/changes#r3159044061
There was a problem hiding this comment.
Should have read your previous comment more carefully. Queue() is quite clean and performant.
My only concern is that FastIncrementalDetokenizer (which runs on the main thread) uses the Rust backend directly. I'll try either patching _tokenizer's methods or make FastIncrementalDetokenizer go through the Python tokenizer.
vllm/vllm/v1/engine/detokenizer.py
Lines 168 to 177 in c42981d
There was a problem hiding this comment.
You can ignore FastIncrementalDetokenizer for now. As I understand, it runs on the core and will only run in a single thread for now.
|
Can you rerun some of the benchmarks with this updated code? |
All the benchmarks are updated as of April 29. They were all rerun when I readied the PR. Main branch: 0ab67c0 To be fair though there is quite a lot of run-to-run variance in the E2E runs. I mainly looked for perf regressions. The micro-benchmark is much more repeatable. |
Purpose
Thread-safe HuggingFace fast tokenizer wrapper for the
RuntimeError: Already borrowedconcurrency issue reported in #40949 .Fix concurrency issues with
bad_wordssampling param. Removes the need to use deepcopy for multimodal processor.Test Plan
benchmark.py
stress_send.py
Test Result
Unite test pass
Benchmark:
Qwen3-4B-Instruct-2507-FP8
Before:
After:
DeepSeek-V4-Flash
Before:
After:
Qwen-VL-Chat
Before:
After:
DeepSeek-OCR
Before:
After:
AI Assistance
Made with Cursor
cc @sfeng33 @bbrowning
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.