-
Notifications
You must be signed in to change notification settings - Fork 31.2k
Description
System Info
transformersversion: 4.35.2- Platform: Linux-6.1.58+-x86_64-with-glibc2.35
- Python version: 3.10.12
- Huggingface_hub version: 0.19.4
- Safetensors version: 0.4.1
- Accelerate version: 0.25.0
- Accelerate config: not found
- PyTorch version (GPU?): 2.1.0+cu121 (False)
- Tensorflow version (GPU?): 2.15.0 (False)
- Flax version (CPU?/GPU?/TPU?): 0.7.5 (cpu)
- Jax version: 0.4.23
- JaxLib version: 0.4.23
- Using GPU in script?: no
- Using distributed or parallel set-up in script?: no
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
from transformers import AutoTokenizer
checkpoint = "facebook/nllb-200-distilled-600M"
tokenizer = AutoTokenizer.from_pretrained(checkpoint, src_lang = "eng_Latn", tgt_lang = "zho_Hans")
tokenizer.add_tokens(["abcd"])
sent = 'I like to walk abcdgym along the beach'
print("tokenizer: ", tokenizer.tokenize(sent))
print("tokenizer: ", tokenizer.decode(tokenizer.encode(sent)[1:-1]))
sent = 'I like to walk gymabcd along the beach'
print("tokenizer: ", tokenizer.tokenize(sent))
print("tokenizer: ", tokenizer.decode(tokenizer.encode(sent)[1:-1]))Expected behavior
The original post where I raised this potential bug and was asked to file an issue would be at: https://discuss.huggingface.co/t/tokenizer-shrinking-recipes/8564/5
For context, I am originally trying to add Chinese tokens to the tokenizer. However, for illustration purposes, I have demonstrated the “bug” in English. Chinese words are not separated by spaces and hence in the example you will see me trying to add a token that is a subword.
Evidently, tokenizer.add_tokens() works well if there will always be space after the added token but it doesn’t work as intended if there isn’t space after the added token (where the tokenizer will then introduce the additional space on its own).
I read the docs and figured out it is probably because the added tokens are isolated before the tokenization algorithm is applied, hence I am not 100% sure this behaviour by the tokenizer is intended.
