Inconsistent special_token addition in EncoderDecoderModel forward pass #31729

emergenz · 2024-07-01T13:13:06Z

System Info

transformers version: 4.40.0
Platform: Linux-4.18.0-513.11.1.el8_9.x86_64-x86_64-with-glibc2.28
Python version: 3.11.9
Huggingface_hub version: 0.22.2
Safetensors version: 0.4.3
Accelerate version: 0.31.0
Accelerate config: not found
PyTorch version (GPU?): 2.2.2+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help?

@ArthurZucker

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Add a breakpoint at:

transformers/src/transformers/models/encoder_decoder/modeling_encoder_decoder.py

Line 625 in e655029

decoder_outputs = self.decoder(

Run the example script of EncoderDecoderModel

transformers/src/transformers/models/encoder_decoder/modeling_encoder_decoder.py

Lines 562 to 577 in e655029

    
                   >>> from transformers import EncoderDecoderModel, BertTokenizer 
        
                   >>> import torch 
        
                   >>> tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased") 
        
                   >>> model = EncoderDecoderModel.from_encoder_decoder_pretrained( 
        
                   ...     "google-bert/bert-base-uncased", "google-bert/bert-base-uncased" 
        
                   ... )  # initialize Bert2Bert from pre-trained checkpoints 
        
                   >>> # training 
        
                   >>> model.config.decoder_start_token_id = tokenizer.cls_token_id 
        
                   >>> model.config.pad_token_id = tokenizer.pad_token_id 
        
                   >>> model.config.vocab_size = model.config.decoder.vocab_size 
        
                   >>> input_ids = tokenizer("This is a really long text", return_tensors="pt").input_ids 
        
                   >>> labels = tokenizer("This is the corresponding summary", return_tensors="pt").input_ids 
        
                   >>> outputs = model(input_ids=input_ids, labels=labels)

Examine the content of decoder_input_ids. You will see that the bos_token has been added twice.

Expected behavior

The bos_token should only be prepended once to the input to the decoder.

The reason for this bug occurring is that BertTokenizer already wraps the text in bos_token and eos_token. Then, during the EncoderDecoderModel forward pass (specifically during construction of the decoder_input_ids from the labels), another bos_token is added.

This means that in order to use EncoderDecoderModel correctly, one has to tokenize the decoder inputs in a way such that an eos_token is added at the end of the text, but a bos_token is NOT added at the beginning. This is funky and clearly not wanted behavior.

The decoder part of an EncoderDecoderModel is rarely initialized with an encoder-only Transformer. Since the tokenizer of decoder-only Transformers do not add bos_tokens nor eos_tokens, I think the correct way to change the behavior of the EncoderDecoderModel would be to add both the eos_token as well as the bos_token in the model forward pass. To clear up confusion, we should change the official example to constructing an EncoderDecoderModel from an encoder-only Transformer and a decoder-only Transformer.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent special_token addition in EncoderDecoderModel forward pass #31729

Inconsistent special_token addition in EncoderDecoderModel forward pass #31729

emergenz commented Jul 1, 2024

Inconsistent special_token addition in EncoderDecoderModel forward pass #31729

Inconsistent special_token addition in EncoderDecoderModel forward pass #31729

Comments

emergenz commented Jul 1, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior