Unable to use BLIP2 with caption_coco_opt6.7b at HEAD via salesforce-lavis (also HEAD)

### System Info

working:

- `transformers` version: 4.26.1
- Platform: Linux-6.0.12-x86_64-with-glibc2.10
- Python version: 3.8.16
- Huggingface_hub version: 0.12.0
- PyTorch version (GPU?): 1.13.1+cu117 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: no

broken:
- `transformers` version: 4.27.0.dev0
- Platform: Linux-6.0.12-x86_64-with-glibc2.10
- Python version: 3.8.16
- Huggingface_hub version: 0.12.0
- PyTorch version (GPU?): 1.13.1+cu117 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: no

### Who can help?

@gante @NielsRogge 

### Information

- [ ] The official example scripts
- [X] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

1. Start with clean env setup via https://github.com/salesforce/LAVIS/blob/main/requirements.txt (transformers-4.26.1)
2. Run `python test_simple.py`, model is correctly loaded and prints a caption
3. `pip install --upgrade git+https://github.com/huggingface/transformers` (I wanted the new shiny blip2 conversion script so I can conver my finetuned model into HF format)
4. `Resolved https://github.com/huggingface/transformers to commit 8b3db33a763ccef828fca89bac7e6cbff314f131`
5.  Run `python test_simple.py`
6. `RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 25 but got size 5 for tensor number 1 in the list.`

```python
import torch
from lavis.models import load_model_and_preprocess
import torch
from PIL import Image
import requests

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model, vis_processors, _ = load_model_and_preprocess(name="blip2_opt", model_type="caption_coco_opt6.7b", is_eval=True, device=device)

url = "..."
raw_image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
data = model.generate({"image": image})
print(data)
```

### Expected behavior

Can use BLIP2 with latest HF

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unable to use BLIP2 with caption_coco_opt6.7b at HEAD via salesforce-lavis (also HEAD) #21713

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unable to use BLIP2 with caption_coco_opt6.7b at HEAD via salesforce-lavis (also HEAD) #21713

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions