Async tokenization using thread pool#3206
Closed
njhill wants to merge 4 commits intovllm-project:mainfrom
Closed
Conversation
3d314f0 to
85e01c3
Compare
Member
Author
a5851e7 to
8b55b05
Compare
Collaborator
|
Q: is using thread pool actually helping performance? I was curious mainly due to GIL (and I suspect tokenizer is using CPU)? |
It's true. And maybe we can just use the process pool |
Member
Author
|
@rkooo567 @nickshawn most of the huggingface tokenizers (in particular those for the most prominent models) use "fast" rust-based implementations which won't hold the GIL. |
8b55b05 to
71d73ed
Compare
joerunde
pushed a commit
to IBM/vllm
that referenced
this pull request
Mar 11, 2024
joerunde
pushed a commit
to IBM/vllm
that referenced
this pull request
Mar 11, 2024
Co-Authored-By: Nick Hill <nickhill@us.ibm.com> Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
joerunde
pushed a commit
to IBM/vllm
that referenced
this pull request
Mar 12, 2024
Co-Authored-By: Nick Hill <nickhill@us.ibm.com> Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
Member
Author
njhill
added a commit
to njhill/vllm
that referenced
this pull request
Mar 19, 2024
vllm-project#2879 added support for using ray to offload tokenization from the asyncio event loop. This PR extends that to support using a thread pool instead of ray, and makes that the default, with the default pool size determined based on the number of available CPU cores and the tensor parallel size. The main thing to note is that separate tokenizer instances are used per thread. This is because officially the HF tokenizers are not thread-safe. In practice I think they are unless you're making use of padding/truncation, which we aren't currently but may want to soon (see for example vllm-project#3144). Also includes some type hint additions to related parts of the code. This replaces the original PR vllm-project#3206 from before vllm-project#2879 was reworked and merged.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
@Yard1's open PR #2879 uses ray to offload tokenization from the asyncio event loop.
This PR extends that to support using a thread pool instead of ray. Here is the diff showing just the newly added commits (note that I also rebased onto the latest main branch).
The main thing to note is that separate tokenizer instances are used per thread. This is because officially the HF tokenizers are not thread-safe. In practice I think they are unless you're making use of padding/truncation, which we aren't currently but may want to soon (see for example #3144).