-
Notifications
You must be signed in to change notification settings - Fork 486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix fast tokenizer swallows prefix space when there are too many white spaces #992
Conversation
Conflicts: lmdeploy/turbomind/hf_repo/modeling_lmdeploy.py
Could you provide an example to reproduce the issue? |
Yes, the case is already covered in unit tests now. This is an example for main branch: def test_tokenizer(model_path, input):
from lmdeploy.tokenizer import HuggingFaceTokenizer
tokenizer = HuggingFaceTokenizer(model_path)
encoded = tokenizer.encode(input, False)
output = ''
offset = 0
for i in range(1, len(encoded) + 1):
decoded = tokenizer.decode(encoded[:i], offset)
if decoded.endswith('�'):
continue
output += decoded
offset = i
assert input == output, 'input string should equal to output after enc-dec'
test_tokenizer('01-ai/Yi-34B-Chat', 'a b') |
Did not influence the performance of |
output_tokens = prev_tokens + new_tokens | ||
prev_tokens += new_tokens | ||
|
||
prefix_text = self._convert_tokens_to_string_with_added_encoders( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
看了vllm的实现。其中,有一个分支是:
if tokenizer.is_fast or not tokenizer.get_added_vocab():
prefix_text = tokenizer.convert_tokens_to_string(
output_tokens[prefix_offset:read_offset])
new_text = tokenizer.convert_tokens_to_string(
output_tokens[prefix_offset:])
但是,我们这里没有。原因是?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
移到_convert_tokens_to_string_with_added_encoders
函数里了
The previous detokenization can not handle many empty tokens for
LlamaTokenizer
such asupstage/SOLAR-0-70b-16bit
model. The incrementally decoding output may lack a white space in this case.This PR does not introduce BC breaking since a new function named
detokenize_incrementally
is implemented.However, full tests are required.