Skip to content

[Tokenizer] Inconsistent behavior when decoding a single ID and a list of the single ID #29489

@Ki-Seki

Description

@Ki-Seki

System Info

  • transformers version: 4.39.0.dev0
  • Platform: Linux-5.4.0-163-generic-x86_64-with-glibc2.10
  • Python version: 3.8.18
  • Huggingface_hub version: 0.20.3
  • Safetensors version: 0.4.2
  • Accelerate version: 0.27.2
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.1.2+cu121 (True)
  • Tensorflow version (GPU?): 2.13.1 (True)
  • Flax version (CPU?/GPU?/TPU?): 0.7.0 (cpu)
  • Jax version: 0.4.13
  • JaxLib version: 0.4.13
  • Using GPU in script?: no need
  • Using distributed or parallel set-up in script?:no need

Who can help?

@ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Code

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=False)
int_single_id = tokenizer.vocab_size-1
list_single_id = [tokenizer.vocab_size-1]
print(f'<<<<{tokenizer.decode(int_single_id)}>>>>')
print(f'<<<<{tokenizer.decode(list_single_id)}>>>>')

tokenizer = AutoTokenizer.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base", use_fast=False)
int_single_id = tokenizer.vocab_size-1
list_single_id = [tokenizer.vocab_size-1]
print(f'<<<<{tokenizer.decode(int_single_id)}>>>>')
print(f'<<<<{tokenizer.decode(list_single_id)}>>>>')

# Roughly estimated, around 15 models would have this issue.

Output

<<<<# # ~>>>>
<<<<##~>>>>
<<<<# # ~>>>>
<<<<##~>>>>

Expected behavior

Consistent behaviors. For example, when decoding the single ID, the output could also be ##~.

Suspected rationale: In the src/transformers/tokenization_utils.py, the _decode function incorrectly uses spaces_between_special_tokens, and then adds spaces between the sub-tokens.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Core: TokenizationInternals of the library; Tokenization.Good Second IssueIssues that are more difficult to do than "Good First" issues - give it a try if you want!

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions