Skip to content

Async tokenization using thread pool#3206

Closed
njhill wants to merge 4 commits intovllm-project:mainfrom
njhill:async_tokenization
Closed

Async tokenization using thread pool#3206
njhill wants to merge 4 commits intovllm-project:mainfrom
njhill:async_tokenization

Conversation

@njhill
Copy link
Member

@njhill njhill commented Mar 5, 2024

@Yard1's open PR #2879 uses ray to offload tokenization from the asyncio event loop.

This PR extends that to support using a thread pool instead of ray. Here is the diff showing just the newly added commits (note that I also rebased onto the latest main branch).

The main thing to note is that separate tokenizer instances are used per thread. This is because officially the HF tokenizers are not thread-safe. In practice I think they are unless you're making use of padding/truncation, which we aren't currently but may want to soon (see for example #3144).

@njhill njhill mentioned this pull request Mar 5, 2024
@njhill njhill force-pushed the async_tokenization branch from 3d314f0 to 85e01c3 Compare March 5, 2024 16:24
@njhill
Copy link
Member Author

njhill commented Mar 5, 2024

@Yard1 separate PR to hopefully address the failing Ray distributed CI tests: #3207

@njhill njhill force-pushed the async_tokenization branch from a5851e7 to 8b55b05 Compare March 6, 2024 01:28
@rkooo567
Copy link
Collaborator

rkooo567 commented Mar 6, 2024

Q: is using thread pool actually helping performance? I was curious mainly due to GIL (and I suspect tokenizer is using CPU)?

@nickshawn
Copy link

Q: is using thread pool actually helping performance? I was curious mainly due to GIL (and I suspect tokenizer is using CPU)?

It's true. And maybe we can just use the process pool ProcessPoolExecutor?

@njhill
Copy link
Member Author

njhill commented Mar 6, 2024

@rkooo567 @nickshawn most of the huggingface tokenizers (in particular those for the most prominent models) use "fast" rust-based implementations which won't hold the GIL.

@njhill njhill force-pushed the async_tokenization branch from 8b55b05 to 71d73ed Compare March 8, 2024 01:15
joerunde pushed a commit to IBM/vllm that referenced this pull request Mar 11, 2024
joerunde pushed a commit to IBM/vllm that referenced this pull request Mar 11, 2024
Co-Authored-By: Nick Hill <nickhill@us.ibm.com>
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
joerunde pushed a commit to IBM/vllm that referenced this pull request Mar 12, 2024
Co-Authored-By: Nick Hill <nickhill@us.ibm.com>
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
@njhill
Copy link
Member Author

njhill commented Mar 16, 2024

@Yard1 I've opened #3449 to replace this now that your PR is merged, ptal!

@njhill njhill closed this Mar 16, 2024
njhill added a commit to njhill/vllm that referenced this pull request Mar 19, 2024
vllm-project#2879 added support for using ray to offload tokenization from the asyncio event loop.

This PR extends that to support using a thread pool instead of ray, and makes that the default, with the default pool size determined based on the number of available CPU cores and the tensor parallel size.

The main thing to note is that separate tokenizer instances are used per thread. This is because officially the HF tokenizers are not thread-safe. In practice I think they are unless you're making use of padding/truncation, which we aren't currently but may want to soon (see for example vllm-project#3144).

Also includes some type hint additions to related parts of the code.

This replaces the original PR vllm-project#3206 from before vllm-project#2879 was reworked and merged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants