Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent special_token addition in EncoderDecoderModel forward pass #31729

Open
1 of 4 tasks
emergenz opened this issue Jul 1, 2024 · 0 comments
Open
1 of 4 tasks

Comments

@emergenz
Copy link

emergenz commented Jul 1, 2024

System Info

  • transformers version: 4.40.0
  • Platform: Linux-4.18.0-513.11.1.el8_9.x86_64-x86_64-with-glibc2.28
  • Python version: 3.11.9
  • Huggingface_hub version: 0.22.2
  • Safetensors version: 0.4.3
  • Accelerate version: 0.31.0
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.2.2+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: No

Who can help?

@ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. Add a breakpoint at:

  2. Run the example script of EncoderDecoderModel

    >>> from transformers import EncoderDecoderModel, BertTokenizer
    >>> import torch
    >>> tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased")
    >>> model = EncoderDecoderModel.from_encoder_decoder_pretrained(
    ... "google-bert/bert-base-uncased", "google-bert/bert-base-uncased"
    ... ) # initialize Bert2Bert from pre-trained checkpoints
    >>> # training
    >>> model.config.decoder_start_token_id = tokenizer.cls_token_id
    >>> model.config.pad_token_id = tokenizer.pad_token_id
    >>> model.config.vocab_size = model.config.decoder.vocab_size
    >>> input_ids = tokenizer("This is a really long text", return_tensors="pt").input_ids
    >>> labels = tokenizer("This is the corresponding summary", return_tensors="pt").input_ids
    >>> outputs = model(input_ids=input_ids, labels=labels)

  3. Examine the content of decoder_input_ids. You will see that the bos_token has been added twice.

Expected behavior

The bos_token should only be prepended once to the input to the decoder.

The reason for this bug occurring is that BertTokenizer already wraps the text in bos_token and eos_token. Then, during the EncoderDecoderModel forward pass (specifically during construction of the decoder_input_ids from the labels), another bos_token is added.

This means that in order to use EncoderDecoderModel correctly, one has to tokenize the decoder inputs in a way such that an eos_token is added at the end of the text, but a bos_token is NOT added at the beginning. This is funky and clearly not wanted behavior.

The decoder part of an EncoderDecoderModel is rarely initialized with an encoder-only Transformer. Since the tokenizer of decoder-only Transformers do not add bos_tokens nor eos_tokens, I think the correct way to change the behavior of the EncoderDecoderModel would be to add both the eos_token as well as the bos_token in the model forward pass. To clear up confusion, we should change the official example to constructing an EncoderDecoderModel from an encoder-only Transformer and a decoder-only Transformer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant