Skip to content

[Tokenizer] Add an option to specify tokenizer#284

Merged
WoosukKwon merged 13 commits intomainfrom
custom-tokenizer
Jun 28, 2023
Merged

[Tokenizer] Add an option to specify tokenizer#284
WoosukKwon merged 13 commits intomainfrom
custom-tokenizer

Conversation

@WoosukKwon
Copy link
Copy Markdown
Collaborator

@WoosukKwon WoosukKwon commented Jun 28, 2023

Fixes #111 #246 #259 #270 #281

This PR adds tokenizer to the input/cli arguments. If it is None, vLLM uses the model name/path as the tokenizer name/path. In addition, from this PR, vLLM does not use hf-internal-testing/llama-tokenizer as the default tokenizer for llama models.

@WoosukKwon WoosukKwon requested a review from zhuohan123 June 28, 2023 08:31
@sleepcoo
Copy link
Copy Markdown
Contributor

This pr is very useful, my local test is always hard code tokenizer path

Copy link
Copy Markdown
Member

@zhuohan123 zhuohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the great work!

Comment on lines +20 to +25
if "llama" in tokenizer_name.lower() and kwargs.get("use_fast", True):
logger.info(
"OpenLLaMA models do not support the fast tokenizer. "
"Using the slow tokenizer instead.")
elif config.model_type == "llama" and kwargs.get("use_fast", True):
# LLaMA fast tokenizer causes protobuf errors in some environments.
# However, we found that the below LLaMA fast tokenizer works well in
# most environments.
model_name = "hf-internal-testing/llama-tokenizer"
logger.info(
f"Using the LLaMA fast tokenizer in '{model_name}' to avoid "
"potential protobuf errors.")
elif config.model_type in _MODEL_TYPES_WITH_SLOW_TOKENIZER:
if kwargs.get("use_fast", False) == True:
raise ValueError(
f"Cannot use the fast tokenizer for {config.model_type} due to "
"bugs in the fast tokenizer.")
logger.info(
f"Using the slow tokenizer for {config.model_type} due to bugs in "
"the fast tokenizer. This could potentially lead to performance "
"degradation.")
kwargs["use_fast"] = False
return AutoTokenizer.from_pretrained(model_name, *args, **kwargs)
"For some LLaMA-based models, initializing the fast tokenizer may "
"take a long time. To eliminate the initialization time, consider "
f"using '{_FAST_LLAMA_TOKENIZER}' instead of the original "
"tokenizer.")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this PR, do we need to manually specify llama to use the fast tokenizer for benchmarking?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends. Actually, LLaMA fast tokenizers in lmsys/vicuna-7b-v1.3 or huggyllama/llama-7b work in my docker environment. So hf-internal-testing/llama-tokenizer is not needed when I use vLLM in my docker environment.

@WoosukKwon WoosukKwon merged commit 4338cc4 into main Jun 28, 2023
@WoosukKwon WoosukKwon deleted the custom-tokenizer branch June 28, 2023 16:47
@929359291
Copy link
Copy Markdown

wow is cool,bro you are so cool

@sunyuhan19981208
Copy link
Copy Markdown

THANKS VERY MUCH!

michaelfeil pushed a commit to michaelfeil/vllm that referenced this pull request Jul 1, 2023
hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024
yukavio pushed a commit to yukavio/vllm that referenced this pull request Jul 3, 2024
SUMMARY:
* only run 4 x a10 tests for python 3.10.12

NOTE: AWS looks to be having availability issues with these instances.
i'm day to day with this repo being migrated to GCP, so in the meantime
let's reduce demand.

TEST PLAN:
n/a

Co-authored-by: andy-neuma <andy@neuralmagic.com>
billishyahao pushed a commit to billishyahao/vllm that referenced this pull request Dec 31, 2024
* fix cuda compilation

* checkout tuned gemm from develop
dtrifiro pushed a commit to dtrifiro/vllm that referenced this pull request Dec 11, 2025
Sync to upstream's
[v0.11.0](https://github.com/vllm-project/vllm/releases/tag/v0.11.0)
release + a cherry pick of
vllm-project#24768

This PR targets CUDA but may also be sufficient for ROCM.

Dockerfile updates:
- general updates to match upstream's Dockerfile
- nvcc, nvrtc and cuobjdump were addded for deepgemm JIT requirementes:
neuralmagic/nm-vllm-ent@2a545c8
- missing paths were added for triton JIT:
neuralmagic/nm-vllm-ent@b3027fc

Tests:
Branch in nm-cicd:
https://github.com/neuralmagic/nm-cicd/tree/sync-v0.11-cuda
accept-sync:
https://github.com/neuralmagic/nm-cicd/actions/runs/18270550524 --
please ignore unit tests, they need to be updated to v1.
Image tested: quay.io/vllm/automation-vllm:cuda-18270550524
Image validation:
https://github.com/neuralmagic/nm-cicd/actions/runs/18271507914
Whisper runs:
https://github.com/neuralmagic/nm-cicd/actions/runs/18281815955/job/52046560584
https://github.com/neuralmagic/nm-cicd/actions/runs/18281511979
mickg10 pushed a commit to mickg10/vllm that referenced this pull request Feb 11, 2026
(cherry picked from commit 0c8ef2a)

Signed-off-by: Salar <skhorasgani@tenstorrent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

5 participants