[Bugfix] Add num_special_tokens_to_add to MistralTokenizer, fixes #22013#22121
[Bugfix] Add num_special_tokens_to_add to MistralTokenizer, fixes #22013#22121ShUl0w wants to merge 1 commit intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Code Review
This pull request fixes an AttributeError in the MistralTokenizer by adding the num_special_tokens_to_add method. While the fix unblocks the benchmark, the implementation is conceptually mismatched with the standard tokenizer API and has inconsistencies that could lead to future bugs. I've suggested a refactoring to improve clarity, correctness, and safety.
There was a problem hiding this comment.
This implementation of num_special_tokens_to_add has a few issues that could lead to incorrect behavior and confusion for future developers:
-
Conceptual Mismatch: The method name
num_special_tokens_to_addis part of the Hugging Face Transformers tokenizer API and typically reflects the number of special tokens added by theencode()method. However, this implementation seems to calculate the tokens forapply_chat_template()(i.e.,[INST],[/INST], and aBOStoken). This is misleading, asself.encode()only adds aBOStoken by default. This can lead to incorrect token calculations and truncation if used outside of the specific benchmark context. -
Unsupported
pairLogic: The logic forpair=Trueis not supported by the tokenizer's__call__method, which currently ignores thetext_pairargument. This inconsistency means any code callingnum_special_tokens_to_add(pair=True)will get a number that doesn't match the tokenizer's actual behavior, leading to bugs. -
Redundant
bos_token_idCheck: Thebos_token_idproperty is typed to return anintand likely neverNone. Thehasattrcheck is also redundant. This makes the logic unnecessarily complex.
To improve correctness and clarity, I suggest refactoring this method to clearly state its purpose and remove the unsupported pair logic. The implementation should be simplified to reflect the tokens added for a single-turn chat prompt.
Here is a suggested implementation with improved comments:
| def num_special_tokens_to_add(self, pair: bool = False) -> int: | |
| # accomodates for [INST] and [/INST], which are always added | |
| num_tokens = 2 | |
| # MistralTokenizer does not appear to add an eos token | |
| if hasattr(self, "bos_token_id") and self.bos_token_id is not None: | |
| num_tokens += 1 | |
| # SentencePiece adds <0x0A><0x0A> in this case, tekken does not | |
| if pair and self.is_spm: | |
| num_tokens += 2 | |
| return num_tokens | |
| def num_special_tokens_to_add(self, pair: bool = False) -> int: | |
| """ | |
| Returns the number of special tokens added by the chat template for a | |
| single-turn conversation. | |
| Note: This method's behavior is specific to chat completion and does not | |
| reflect the special tokens added by the `encode()` method. | |
| """ | |
| if pair: | |
| # The `__call__` method of this tokenizer does not support encoding | |
| # pairs of sequences. | |
| raise NotImplementedError( | |
| "Encoding token pairs is not supported by this tokenizer." | |
| ) | |
| # A single-turn chat prompt is templated with <s>, [INST], and [/INST]. | |
| # self.tokenizer.bos_id is expected to be always present. | |
| num_tokens = 3 # BOS + [INST] + [/INST] | |
| return num_tokens |
There was a problem hiding this comment.
I understand the points outlined by the automatic code review, however, they deviate from my prior inspection into the respective implementations.
1 & 3: Please find my reasoning for the implementation here. I don't wan't to rule out a misunderstanding of the tokenizer on my end.
2: I tried to stick as close to the original implementation of the function in transformers as possible, without adding any further possibly confusing functions (i.e. build_input_with_special_tokens).
As this is my first PR to the project, I'd be happy to receive guidance on how to proceed, and what changes I should add.
|
Hey @ShUl0w, Thanks for the PR! I don't think the bool vllm/benchmarks/benchmark_dataset.py Line 314 in 6d98843 I don't think we have to / should stick to the transformers design as this is not a tokenizer in transformers format. |
Signed-off-by: ShUl0w <37832993+ShUl0w@users.noreply.github.com>
9039be4 to
be39b4c
Compare
|
@patrickvonplaten thanks a lot for your quick feedback! I've removed the |
|
Hi @patrickvonplaten, is there anything else I can/should do in this PR? Thanks for a quick info! |
|
Hi! Is there any update on this? I'm (unsuccessfully) trying to benchmark Thanks in advance!! (cc @ShUl0w @patrickvonplaten) |
|
Superseded by #30009 |
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Purpose
Test Plan
Manually testing the serve benchmark without the changes and with the changes in place.
uv run benchmarks/benchmark_serving.py --backend openai-chat --model /models/ministral-8b-instruct-2410 --served-model-name ministral-8b-instruct-2410 --endpoint /v1/chat/completions --dataset-name random --num-prompts 1 --tokenizer_mode mistral --base-url "http://0.0.0.0:8000" --random-input-len 64 --random-output-len 64Test Result
Manually testing without the changes produces an error:
Manually testing with the changes produces the desired result, i.e. a full run of the benchmark:
(Optional) Documentation Update