[BLIP] fix cross attentions for BlipTextEncoder#22515
Conversation
|
The documentation is not available anymore as the PR was closed or merged. |
|
cc @ArthurZucker and @younesbelkada |
younesbelkada
left a comment
There was a problem hiding this comment.
Thanks for the PR! Could you share more details about the bug you have encountered (i.e. how to reproduce it) and how this PR is relevant to fix that bug?
|
Sure, happy to provide more details. This bug is caused by the all_cross_attentions variable not properly storing the cross-attention produced by each BlipTextLayer. The variable is initialized at line 404 and returned in either line 460 or 469, but it remains unchanged between initialization and return. As a result, the forward function consistently returns an empty tuple for cross-attention. To address this issue, I have made changes to ensure that all_cross_attentions correctly stores the cross-attention produced by each BlipTextLayer, allowing the forward function to return the appropriate cross-attention. To reproduce the bug, please run the following snippet (the returned cross attentions will always be an empty tuple): import torch
from PIL import Image
from transformers import BlipProcessor, BlipForQuestionAnswering
processor = BlipProcessor.from_pretrained("Salesforce/blip-vqa-base")
model = BlipForQuestionAnswering.from_pretrained("Salesforce/blip-vqa-base").to("cuda")
model.text_encoder.config.output_attentions = True
img_path = "path of an image"
raw_image = Image.open(img_path).convert('RGB')
name = "cat"
question = [
"Is there a {} in the view?".format(name),
]
inputs = processor([raw_image]*len(question), question, padding=True, return_tensors="pt").to("cuda")
vision_outputs = model.vision_model(inputs['pixel_values'])
image_embeds = vision_outputs[0]
image_attention_mask = torch.ones(image_embeds.size()[:-1], dtype=torch.long).to(image_embeds.device)
question_outputs = model.text_encoder(
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
encoder_hidden_states=image_embeds,
encoder_attention_mask=image_attention_mask,
return_dict=True
)
# question_outputs['cross_attentions'] will always be an empty tuple
print(question_outputs['cross_attentions']) |
There was a problem hiding this comment.
Thanks a lot for fixing and clarifying! Your fix is the right fix
This has been never triggered on the CIs since to check for cross_attention_output existence, the Model tester needs to have an attribute is_encoder_decoder which has to be set to True. This can be addressed in a follow-up PR
Thanks for your contribution!
What does this PR do?
Fixes a bug in the output of the cross attentions in BlipTextEncoder
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.