-
Notifications
You must be signed in to change notification settings - Fork 31.6k
Closed
Labels
Core: TokenizationInternals of the library; Tokenization.Internals of the library; Tokenization.Good Second IssueIssues that are more difficult to do than "Good First" issues - give it a try if you want!Issues that are more difficult to do than "Good First" issues - give it a try if you want!
Description
System Info
transformersversion: 4.39.0.dev0- Platform: Linux-5.4.0-163-generic-x86_64-with-glibc2.10
- Python version: 3.8.18
- Huggingface_hub version: 0.20.3
- Safetensors version: 0.4.2
- Accelerate version: 0.27.2
- Accelerate config: not found
- PyTorch version (GPU?): 2.1.2+cu121 (True)
- Tensorflow version (GPU?): 2.13.1 (True)
- Flax version (CPU?/GPU?/TPU?): 0.7.0 (cpu)
- Jax version: 0.4.13
- JaxLib version: 0.4.13
- Using GPU in script?: no need
- Using distributed or parallel set-up in script?:no need
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Code
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=False)
int_single_id = tokenizer.vocab_size-1
list_single_id = [tokenizer.vocab_size-1]
print(f'<<<<{tokenizer.decode(int_single_id)}>>>>')
print(f'<<<<{tokenizer.decode(list_single_id)}>>>>')
tokenizer = AutoTokenizer.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base", use_fast=False)
int_single_id = tokenizer.vocab_size-1
list_single_id = [tokenizer.vocab_size-1]
print(f'<<<<{tokenizer.decode(int_single_id)}>>>>')
print(f'<<<<{tokenizer.decode(list_single_id)}>>>>')
# Roughly estimated, around 15 models would have this issue.Output
<<<<# # ~>>>>
<<<<##~>>>>
<<<<# # ~>>>>
<<<<##~>>>>
Expected behavior
Consistent behaviors. For example, when decoding the single ID, the output could also be ##~.
Suspected rationale: In the src/transformers/tokenization_utils.py, the _decode function incorrectly uses spaces_between_special_tokens, and then adds spaces between the sub-tokens.
ArthurZucker
Metadata
Metadata
Assignees
Labels
Core: TokenizationInternals of the library; Tokenization.Internals of the library; Tokenization.Good Second IssueIssues that are more difficult to do than "Good First" issues - give it a try if you want!Issues that are more difficult to do than "Good First" issues - give it a try if you want!