-
Notifications
You must be signed in to change notification settings - Fork 31.6k
Add ColQwen2 to 🤗 transformers #35778
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 60 commits
551e46c
154d843
055eb5e
c0c7248
0a1e9f0
99a5961
c731365
5985784
c6567d4
0cb74d9
6109920
b090847
607cd78
6c261cf
b027a9d
0302b12
5eaa32b
ebb89b5
bdbaa2b
6fbac2a
6931500
985575c
68ba7b8
bae3119
7dcc1e0
17882c2
da93dcf
2b1ef88
60d4033
bb27ef9
a31b2f3
45fba97
78d051d
ee9800b
4f76803
cb924e3
f8f8261
824b331
6dc1d22
d325c01
61a578c
ff59eb2
f48568b
45d1dbe
7b0f900
f171ed6
eaa797b
c8e360f
14d7b5c
0686b2a
10b3ddb
bdef63f
4b7f635
c638c07
30d2080
8277c43
86e0693
c0a6442
98a5338
4aa5aa0
34ca1e7
43af0ad
0356f3c
7a4218b
58c7ff2
3852c86
bd65ad3
1bc3dea
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,166 @@ | ||||||||||
| <!--Copyright 2025 The HuggingFace Team. All rights reserved. | ||||||||||
| Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | ||||||||||
| the License. You may obtain a copy of the License at | ||||||||||
| http://www.apache.org/licenses/LICENSE-2.0 | ||||||||||
| Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | ||||||||||
| an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | ||||||||||
| specific language governing permissions and limitations under the License. | ||||||||||
| ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be | ||||||||||
| rendered properly in your Markdown viewer. | ||||||||||
| --> | ||||||||||
|
|
||||||||||
| <div style="float: right;"> | ||||||||||
| <div class="flex flex-wrap space-x-1"> | ||||||||||
| <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white"> | ||||||||||
| </div> | ||||||||||
| </div> | ||||||||||
|
|
||||||||||
| # ColQwen2 | ||||||||||
|
|
||||||||||
| [ColQwen2](https://doi.org/10.48550/arXiv.2407.01449) is a variant of the [ColPali](./colpali) model designed to retrieve documents by analyzing their visual features. Unlike traditional systems that rely heavily on text extraction and OCR, ColQwen2 treats each page as an image. It uses the [Qwen2-VL](./qwen2_vl) backbone to capture not only text, but also the layout, tables, charts, and other visual elements to create detailed multi-vector embeddings that can be used for retrieval by computing pairwise late interaction similarity scores. This offers a more comprehensive understanding of documents and enables more efficient and accurate retrieval. | ||||||||||
|
|
||||||||||
| This model was contributed by [@tonywu71](https://huggingface.co/tonywu71) (ILLUIN Technology) and [@yonigozlan](https://huggingface.co/yonigozlan) (HuggingFace). | ||||||||||
|
|
||||||||||
| You can find all the original ColPali checkpoints under Vidore's [Hf-native ColVision Models](https://huggingface.co/collections/vidore/hf-native-colvision-models-6755d68fc60a8553acaa96f7) collection. | ||||||||||
|
|
||||||||||
| > [!TIP] | ||||||||||
| > Click on the ColQwen2 models in the right sidebar for more examples of how to use ColQwen2 for image retrieval. | ||||||||||
| <hfoptions id="usage"> | ||||||||||
| <hfoption id="image retrieval"> | ||||||||||
|
|
||||||||||
| ```python | ||||||||||
| import requests | ||||||||||
| import torch | ||||||||||
| from PIL import Image | ||||||||||
|
|
||||||||||
| from transformers import ColQwen2ForRetrieval, ColQwen2Processor | ||||||||||
| from transformers.utils.import_utils import is_flash_attn_2_available | ||||||||||
|
|
||||||||||
|
|
||||||||||
| model_name = "vidore/colqwen2-v1.0-hf" | ||||||||||
|
|
||||||||||
| # Load model | ||||||||||
| model = ColQwen2ForRetrieval.from_pretrained( | ||||||||||
| model_name, | ||||||||||
| torch_dtype=torch.bfloat16, | ||||||||||
| device_map="auto", # "cpu", "cuda", or "mps" for Apple Silicon | ||||||||||
| attn_implementation="flash_attention_2" if is_flash_attn_2_available() else None, | ||||||||||
| ) | ||||||||||
|
||||||||||
| attn_implementation="flash_attention_2" if is_flash_attn_2_available() else None, | |
| ) | |
| attn_implementation="flash_attention_2" if is_flash_attn_2_available() else "sdpa", | |
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed! It's been addressed 👌🏼
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it seems sdpa doesn't work out-of-the-box for ColQwen2 as I get this error when loading the model on MPS.
❌ Code:
model_name = "vidore/colqwen2-v1.0-hf"
# Load model
model = ColQwen2ForRetrieval.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto", # "cpu", "cuda", or "mps" for Apple Silicon
attn_implementation="flash_attention_2" if is_flash_attn_2_available() else "sdpa",
)Note: Leaving attn_implementation=None works.
The error:
ValueError: ColQwen2ForRetrieval does not support an attention implementation through torch.nn.functional.scaled_dot_product_attention yet. Please request the support for this architecture: https://github.com/huggingface/transformers/issues/28005. If you believe this error is a bug, please open an issue in Transformers GitHub repository and load your model with the argument `attn_implementation="eager"` meanwhile. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="eager")`
✅ However, I managed to load Qwen2VL with SDPA:
model = Qwen2VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2-VL-2B-Instruct",
torch_dtype=torch.bfloat16,
device_map="auto", # "cpu", "cuda", or "mps" for Apple Silicon
attn_implementation="flash_attention_2" if is_flash_attn_2_available() else "sdpa",
)@Cyrilvallez @yonigozlan I read about the instructions for enabling SDPA on ColQwen2 but next steps are a bit unclear as ColQwen2 essentially piggybacks on Qwen2VL thanks to modular. Any ideas about the right fix? 🤗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it's only because the flags are not set in the PreTrainedModel - adding
_supports_flash_attn_2 = True
_supports_sdpa = True
_supports_flex_attn = True
_supports_cache_class = Trueshould solve it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tsm, the fix is working like a charm! And as you expected, ColQwen2 works with attn_implementation="flex_attention" too 👌🏼
Uh oh!
There was an error while loading. Please reload this page.