[Misc] Refactor tokenizer interface#29693
Conversation
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
Documentation preview: https://vllm--29693.org.readthedocs.build/en/29693/ |
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
/gemini review |
TokenizerBase to protocol, consolidate tokenizer testsSigned-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
There was a problem hiding this comment.
Code Review
This pull request is a significant and well-executed refactoring that modernizes the tokenizer abstraction by introducing the TokenizerLike protocol. This change improves modularity, type safety, and flexibility across the codebase. The consolidation of tokenizer tests into a dedicated tests/tokenizers directory is also a great organizational improvement.
I've identified a couple of critical bug fixes related to handling cases where the tokenizer is not initialized (skip_tokenizer_init=True), which significantly improves the robustness of the OpenAI-compatible endpoints. These changes prevent potential crashes and provide clearer error messages to the user.
Overall, this is an excellent contribution that enhances the maintainability and reliability of vLLM's tokenization system. The changes are thorough and consistent.
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
Failing tests are known failures on main |
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
Purpose
AnyTokenizerandTokenizerBaseby makingTokenizerBaseaProtocol, and rename both of them toTokenizerLike(with back-compatibility)TokenizerBase:sep_token,pad_token,encode_oneMistralTokenizer,TokenizerLike,TokenizerRegistryintovllm.tokenizersAnyTokenizerwithTokenizerLiketests/tokenizationtotests/tokenizersand update Buildkite pipeline accordinglyTest Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.