Fix fast tokenizer swallows prefix space when there are too many white spaces #992

AllentDan · 2024-01-18T10:41:58Z

The previous detokenization can not handle many empty tokens for LlamaTokenizer such as upstage/SOLAR-0-70b-16bit model. The incrementally decoding output may lack a white space in this case.

This PR does not introduce BC breaking since a new function named detokenize_incrementally is implemented.
However, full tests are required.

…pace

Conflicts: lmdeploy/turbomind/hf_repo/modeling_lmdeploy.py

lvhan028 · 2024-01-25T07:10:58Z

Could you provide an example to reproduce the issue?

AllentDan · 2024-01-25T07:32:26Z

Could you provide an example to reproduce the issue?

Yes, the case is already covered in unit tests now. This is an example for main branch:

def test_tokenizer(model_path, input):
    from lmdeploy.tokenizer import HuggingFaceTokenizer
    tokenizer = HuggingFaceTokenizer(model_path)
    encoded = tokenizer.encode(input, False)
    output = ''
    offset = 0
    for i in range(1, len(encoded) + 1):
        decoded = tokenizer.decode(encoded[:i], offset)
        if decoded.endswith('�'):
            continue
        output += decoded
        offset = i
    assert input == output, 'input string should equal to output after enc-dec'

test_tokenizer('01-ai/Yi-34B-Chat', 'a    b')

AllentDan · 2024-01-26T07:52:35Z

Did not influence the performance of profile_throughput.py.

lvhan028 · 2024-01-29T04:07:46Z

lmdeploy/tokenizer.py

+            output_tokens = prev_tokens + new_tokens
+            prev_tokens += new_tokens
+
+        prefix_text = self._convert_tokens_to_string_with_added_encoders(


看了vllm的实现。其中，有一个分支是：

if tokenizer.is_fast or not tokenizer.get_added_vocab(): prefix_text = tokenizer.convert_tokens_to_string( output_tokens[prefix_offset:read_offset]) new_text = tokenizer.convert_tokens_to_string( output_tokens[prefix_offset:])

但是，我们这里没有。原因是？

移到_convert_tokens_to_string_with_added_encoders函数里了

fix fast tokenizer swallow prefix space when there is too many whites…

38fbe6e

…pace

AllentDan closed this Jan 18, 2024

decode prefix token id when decoding the current ones

343e148

AllentDan reopened this Jan 23, 2024

Merge branch 'main' into fast-tokenizer-space

5508912

lvhan028 added the Bug:P1 label Jan 25, 2024

lvhan028 requested review from lvhan028 and grimoire January 25, 2024 03:08

Merge branch 'main' into fast-tokenizer-space

2c5652e

Conflicts: lmdeploy/turbomind/hf_repo/modeling_lmdeploy.py

grimoire approved these changes Jan 26, 2024

View reviewed changes

lvhan028 reviewed Jan 29, 2024

View reviewed changes

AllentDan had a problem deploying to prod January 29, 2024 11:05 — with GitHub Actions Failure

AllentDan had a problem deploying to prod January 29, 2024 11:06 — with GitHub Actions Failure

Merge branch 'main' into fast-tokenizer-space

db3e097

lvhan028 approved these changes Jan 31, 2024

View reviewed changes

lvhan028 merged commit 29b74d5 into InternLM:main Jan 31, 2024
4 checks passed

AllentDan mentioned this pull request Feb 1, 2024

[Fix] Add safety check for incremental decode #1094

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix fast tokenizer swallows prefix space when there are too many white spaces #992

Fix fast tokenizer swallows prefix space when there are too many white spaces #992

AllentDan commented Jan 18, 2024 •

edited

Loading

lvhan028 commented Jan 25, 2024

AllentDan commented Jan 25, 2024

AllentDan commented Jan 26, 2024

lvhan028 Jan 29, 2024

AllentDan Jan 29, 2024

Fix fast tokenizer swallows prefix space when there are too many white spaces #992

Fix fast tokenizer swallows prefix space when there are too many white spaces #992

Conversation

AllentDan commented Jan 18, 2024 • edited Loading

lvhan028 commented Jan 25, 2024

AllentDan commented Jan 25, 2024

AllentDan commented Jan 26, 2024

lvhan028 Jan 29, 2024

Choose a reason for hiding this comment

AllentDan Jan 29, 2024

Choose a reason for hiding this comment

AllentDan commented Jan 18, 2024 •

edited

Loading