huggingface · yonigozlan · Jun 2, 2025 · Jan 19, 2025 · Apr 15, 2025 · Apr 15, 2025
diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml
@@ -937,6 +937,8 @@
         title: CLVP
       - local: model_doc/colpali
         title: ColPali
+      - local: model_doc/colqwen2
+        title: ColQwen2
       - local: model_doc/data2vec
         title: Data2Vec
       - local: model_doc/deplot

diff --git a/docs/source/en/model_doc/colpali.md b/docs/source/en/model_doc/colpali.md
@@ -20,31 +20,37 @@ rendered properly in your Markdown viewer.
 
 # ColPali
 
-[ColPali](https://huggingface.co/papers/2407.01449) is a model designed to retrieve documents by analyzing their visual features. Unlike traditional systems that rely heavily on text extraction and OCR, ColPali treats each page as an image. It uses [Paligemma-3B](./paligemma) to capture not only text, but also the layout, tables, charts, and other visual elements to create detailed embeddings. This offers a more comprehensive understanding of documents and enables more efficient and accurate retrieval.
+[ColPali](https://huggingface.co/papers/2407.01449) is a model designed to retrieve documents by analyzing their visual features. Unlike traditional systems that rely heavily on text extraction and OCR, ColPali treats each page as an image. It uses [Paligemma-3B](./paligemma) to capture not only text, but also the layout, tables, charts, and other visual elements to create detailed multi-vector embeddings that can be used for retrieval by computing pairwise late interaction similarity scores. This offers a more comprehensive understanding of documents and enables more efficient and accurate retrieval.
 
-You can find all the original ColPali checkpoints under the [ColPali](https://huggingface.co/collections/vidore/hf-native-colvision-models-6755d68fc60a8553acaa96f7) collection.
+This model was contributed by [@tonywu71](https://huggingface.co/tonywu71) (ILLUIN Technology) and [@yonigozlan](https://huggingface.co/yonigozlan) (HuggingFace).
+
+You can find all the original ColPali checkpoints under Vidore's [Hf-native ColVision Models](https://huggingface.co/collections/vidore/hf-native-colvision-models-6755d68fc60a8553acaa96f7) collection.
 
 > [!TIP]
 > Click on the ColPali models in the right sidebar for more examples of how to use ColPali for image retrieval.
 
 <hfoptions id="usage">
 <hfoption id="image retrieval">
 
-```py
+```python
 import requests
 import torch
 from PIL import Image
+
 from transformers import ColPaliForRetrieval, ColPaliProcessor
 
-# Load model (bfloat16 support is limited; fallback to float32 if needed)
+
+# Load the model and the processor
+model_name = "vidore/colpali-v1.3-hf"
+
 model = ColPaliForRetrieval.from_pretrained(
-    "vidore/colpali-v1.2-hf",
-    torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
+    model_name,
+    torch_dtype=torch.bfloat16,
     device_map="auto",  # "cpu", "cuda", or "mps" for Apple Silicon
-).eval()
-
+)
 processor = ColPaliProcessor.from_pretrained(model_name)
 
+# The document page screenshots from your corpus
 url1 = "https://upload.wikimedia.org/wikipedia/commons/8/89/US-original-Declaration-1776.jpg"
 url2 = "https://upload.wikimedia.org/wikipedia/commons/thumb/4/4c/Romeoandjuliet1597.jpg/500px-Romeoandjuliet1597.jpg"
 
@@ -53,38 +59,53 @@ images = [
     Image.open(requests.get(url2, stream=True).raw),
 ]
 
+# The queries you want to retrieve documents for
 queries = [
-    "Who printed the edition of Romeo and Juliet?",
     "When was the United States Declaration of Independence proclaimed?",
+    "Who printed the edition of Romeo and Juliet?",
 ]
 
 # Process the inputs
-inputs_images = processor(images=images, return_tensors="pt").to(model.device)
-inputs_text = processor(text=queries, return_tensors="pt").to(model.device)
+inputs_images = processor(images=images).to(model.device)
+inputs_text = processor(text=queries).to(model.device)
 
 # Forward pass
 with torch.no_grad():
     image_embeddings = model(**inputs_images).embeddings
     query_embeddings = model(**inputs_text).embeddings
 
+# Score the queries against the images
 scores = processor.score_retrieval(query_embeddings, image_embeddings)
 
 print("Retrieval scores (query x image):")
 print(scores)
 ```
+
+If you have issue with loading the images with PIL, you can use the following code to create dummy images:
+
+```python
+images = [
+    Image.new("RGB", (128, 128), color="white"),
+    Image.new("RGB", (64, 32), color="black"),
+]
+```
+
 </hfoption>
 </hfoptions>
 
 Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
 
 The example below uses [bitsandbytes](../quantization/bitsandbytes.md) to quantize the weights to int4.
 
-```py
+```python
 import requests
 import torch
 from PIL import Image
-from transformers import ColPaliForRetrieval, ColPaliProcessor
-from transformers import BitsAndBytesConfig
+
+from transformers import BitsAndBytesConfig, ColPaliForRetrieval, ColPaliProcessor
+
+
+model_name = "vidore/colpali-v1.3-hf"
 
 # 4-bit quantization configuration
 bnb_config = BitsAndBytesConfig(
@@ -94,14 +115,11 @@ bnb_config = BitsAndBytesConfig(
     bnb_4bit_compute_dtype=torch.float16,
 )
 
-model_name = "vidore/colpali-v1.2-hf"
-
-# Load model 
 model = ColPaliForRetrieval.from_pretrained(
     model_name,
     quantization_config=bnb_config,
-    device_map="cuda"
-).eval()
+    device_map="cuda",
+)
 
 processor = ColPaliProcessor.from_pretrained(model_name)
 
@@ -114,8 +132,8 @@ images = [
 ]
 
 queries = [
-    "Who printed the edition of Romeo and Juliet?",
     "When was the United States Declaration of Independence proclaimed?",
+    "Who printed the edition of Romeo and Juliet?",
 ]
 
 # Process the inputs
@@ -127,6 +145,7 @@ with torch.no_grad():
     image_embeddings = model(**inputs_images).embeddings
     query_embeddings = model(**inputs_text).embeddings
 
+# Score the queries against the images
 scores = processor.score_retrieval(query_embeddings, image_embeddings)
 
 print("Retrieval scores (query x image):")

diff --git a/docs/source/en/model_doc/colqwen2.md b/docs/source/en/model_doc/colqwen2.md
@@ -0,0 +1,176 @@
+<!--Copyright 2025 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+<div style="float: right;">
+    <div class="flex flex-wrap space-x-1">
+        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
+    </div>
+</div>
+
+# ColQwen2
+
+[ColQwen2](https://doi.org/10.48550/arXiv.2407.01449) is a variant of the [ColPali](./colpali) model designed to retrieve documents by analyzing their visual features. Unlike traditional systems that rely heavily on text extraction and OCR, ColQwen2 treats each page as an image. It uses the [Qwen2-VL](./qwen2_vl) backbone to capture not only text, but also the layout, tables, charts, and other visual elements to create detailed multi-vector embeddings that can be used for retrieval by computing pairwise late interaction similarity scores. This offers a more comprehensive understanding of documents and enables more efficient and accurate retrieval.
+
+This model was contributed by [@tonywu71](https://huggingface.co/tonywu71) (ILLUIN Technology) and [@yonigozlan](https://huggingface.co/yonigozlan) (HuggingFace).
+
+You can find all the original ColPali checkpoints under Vidore's [Hf-native ColVision Models](https://huggingface.co/collections/vidore/hf-native-colvision-models-6755d68fc60a8553acaa96f7) collection.
+
+> [!TIP]
+> Click on the ColQwen2 models in the right sidebar for more examples of how to use ColQwen2 for image retrieval.
+
+<hfoptions id="usage">
+<hfoption id="image retrieval">
+
+```python
+import requests
+import torch
+from PIL import Image
+
+from transformers import ColQwen2ForRetrieval, ColQwen2Processor
+from transformers.utils.import_utils import is_flash_attn_2_available
+
+
+# Load the model and the processor
+model_name = "vidore/colqwen2-v1.0-hf"
+
+model = ColQwen2ForRetrieval.from_pretrained(
+    model_name,
+    torch_dtype=torch.bfloat16,
+    device_map="auto",  # "cpu", "cuda", or "mps" for Apple Silicon
+    attn_implementation="flash_attention_2" if is_flash_attn_2_available() else "sdpa",
+)
+processor = ColQwen2Processor.from_pretrained(model_name)
+
+# The document page screenshots from your corpus
+url1 = "https://upload.wikimedia.org/wikipedia/commons/8/89/US-original-Declaration-1776.jpg"
+url2 = "https://upload.wikimedia.org/wikipedia/commons/thumb/4/4c/Romeoandjuliet1597.jpg/500px-Romeoandjuliet1597.jpg"
+
+images = [
+    Image.open(requests.get(url1, stream=True).raw),
+    Image.open(requests.get(url2, stream=True).raw),
+]
+
+# The queries you want to retrieve documents for
+queries = [
+    "When was the United States Declaration of Independence proclaimed?",
+    "Who printed the edition of Romeo and Juliet?",
+]
+
+# Process the inputs
+inputs_images = processor(images=images).to(model.device)
+inputs_text = processor(text=queries).to(model.device)
+
+# Forward pass
+with torch.no_grad():
+    image_embeddings = model(**inputs_images).embeddings
+    query_embeddings = model(**inputs_text).embeddings
+
+# Score the queries against the images
+scores = processor.score_retrieval(query_embeddings, image_embeddings)
+
+print("Retrieval scores (query x image):")
+print(scores)
+```
+
+If you have issue with loading the images with PIL, you can use the following code to create dummy images:
+
+```python
+images = [
+    Image.new("RGB", (128, 128), color="white"),
+    Image.new("RGB", (64, 32), color="black"),
+]
+```
+
+</hfoption>
+</hfoptions>
+
+Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
+
+The example below uses [bitsandbytes](../quantization/bitsandbytes.md) to quantize the weights to int4.
+
+```python
+import requests
+import torch
+from PIL import Image
+
+from transformers import BitsAndBytesConfig, ColQwen2ForRetrieval, ColQwen2Processor
+
+
+model_name = "vidore/colqwen2-v1.0-hf"
+
+# 4-bit quantization configuration
+bnb_config = BitsAndBytesConfig(
+    load_in_4bit=True,
+    bnb_4bit_use_double_quant=True,
+    bnb_4bit_quant_type="nf4",
+    bnb_4bit_compute_dtype=torch.float16,
+)
+
+model = ColQwen2ForRetrieval.from_pretrained(
+    model_name,
+    quantization_config=bnb_config,
+    device_map="cuda",
+).eval()
+
+processor = ColQwen2Processor.from_pretrained(model_name)
+
+url1 = "https://upload.wikimedia.org/wikipedia/commons/8/89/US-original-Declaration-1776.jpg"
+url2 = "https://upload.wikimedia.org/wikipedia/commons/thumb/4/4c/Romeoandjuliet1597.jpg/500px-Romeoandjuliet1597.jpg"
+
+images = [
+    Image.open(requests.get(url1, stream=True).raw),
+    Image.open(requests.get(url2, stream=True).raw),
+]
+
+queries = [
+    "When was the United States Declaration of Independence proclaimed?",
+    "Who printed the edition of Romeo and Juliet?",
+]
+
+# Process the inputs
+inputs_images = processor(images=images, return_tensors="pt").to(model.device)
+inputs_text = processor(text=queries, return_tensors="pt").to(model.device)
+
+# Forward pass
+with torch.no_grad():
+    image_embeddings = model(**inputs_images).embeddings
+    query_embeddings = model(**inputs_text).embeddings
+
+# Score the queries against the images
+scores = processor.score_retrieval(query_embeddings, image_embeddings)
+
+print("Retrieval scores (query x image):")
+print(scores)
+```
+
+## Notes
+
+- [`~ColQwen2Processor.score_retrieval`] returns a 2D tensor where the first dimension is the number of queries and the second dimension is the number of images. A higher score indicates more similarity between the query and image.
+- Unlike ColPali, ColQwen2 supports arbitrary image resolutions and aspect ratios, which means images are not resized into fixed-size squares. This preserves more of the original input signal.
+- Larger input images generate longer multi-vector embeddings, allowing users to adjust image resolution to balance performance and memory usage.
+
+## ColQwen2Config
+
+[[autodoc]] ColQwen2Config
+
+## ColQwen2Processor
+
+[[autodoc]] ColQwen2Processor
+
+## ColQwen2ForRetrieval
+
+[[autodoc]] ColQwen2ForRetrieval
+    - forward
diff --git a/src/transformers/models/__init__.py b/src/transformers/models/__init__.py
@@ -62,6 +62,7 @@
     from .cohere import *
     from .cohere2 import *
     from .colpali import *
+    from .colqwen2 import *
     from .conditional_detr import *
     from .convbert import *
     from .convnext import *

diff --git a/src/transformers/models/auto/configuration_auto.py b/src/transformers/models/auto/configuration_auto.py
@@ -79,6 +79,7 @@
         ("cohere", "CohereConfig"),
         ("cohere2", "Cohere2Config"),
         ("colpali", "ColPaliConfig"),
+        ("colqwen2", "ColQwen2Config"),
         ("conditional_detr", "ConditionalDetrConfig"),
         ("convbert", "ConvBertConfig"),
         ("convnext", "ConvNextConfig"),
@@ -437,6 +438,7 @@
         ("cohere", "Cohere"),
         ("cohere2", "Cohere2"),
         ("colpali", "ColPali"),
+        ("colqwen2", "ColQwen2"),
         ("conditional_detr", "Conditional DETR"),
         ("convbert", "ConvBERT"),
         ("convnext", "ConvNeXT"),

diff --git a/src/transformers/models/auto/modeling_auto.py b/src/transformers/models/auto/modeling_auto.py
@@ -365,6 +365,7 @@
         ("bloom", "BloomForCausalLM"),
         ("camembert", "CamembertForMaskedLM"),
         ("colpali", "ColPaliForRetrieval"),
+        ("colqwen2", "ColQwen2ForRetrieval"),
         ("ctrl", "CTRLLMHeadModel"),
         ("data2vec-text", "Data2VecTextForMaskedLM"),
         ("deberta", "DebertaForMaskedLM"),

diff --git a/src/transformers/models/auto/processing_auto.py b/src/transformers/models/auto/processing_auto.py
@@ -66,6 +66,7 @@
         ("clipseg", "CLIPSegProcessor"),
         ("clvp", "ClvpProcessor"),
         ("colpali", "ColPaliProcessor"),
+        ("colqwen2", "ColQwen2Processor"),
         ("emu3", "Emu3Processor"),
         ("flava", "FlavaProcessor"),
         ("fuyu", "FuyuProcessor"),

diff --git a/src/transformers/models/auto/tokenization_auto.py b/src/transformers/models/auto/tokenization_auto.py
@@ -147,6 +147,7 @@
         ("cohere", (None, "CohereTokenizerFast" if is_tokenizers_available() else None)),
         ("cohere2", (None, "CohereTokenizerFast" if is_tokenizers_available() else None)),
         ("colpali", ("LlamaTokenizer", "LlamaTokenizerFast" if is_tokenizers_available() else None)),
+        ("colqwen2", ("Qwen2Tokenizer", "Qwen2TokenizerFast" if is_tokenizers_available() else None)),
         ("convbert", ("ConvBertTokenizer", "ConvBertTokenizerFast" if is_tokenizers_available() else None)),
         (
             "cpm",

diff --git a/src/transformers/models/colpali/configuration_colpali.py b/src/transformers/models/colpali/configuration_colpali.py
@@ -33,8 +33,6 @@ class ColPaliConfig(PretrainedConfig):
     Creating a configuration with the default settings will result in a configuration where the VLM backbone is set to the
     default PaliGemma configuration, i.e the one from [vidore/colpali-v1.2](https://huggingface.co/vidore/colpali-v1.2).
 
-    The ColPali config is very similar to [`PaligemmaConfig`], but with an extra attribute defining the embedding dimension.
-
     Note that contrarily to what the class name suggests (actually the name refers to the ColPali **methodology**), you can
     use a different VLM backbone model than PaliGemma by passing the corresponding VLM configuration to the class constructor.
 
@@ -93,7 +91,7 @@ def __init__(
             )
 
         self.vlm_config = vlm_config
-        self.text_config = text_config = text_config if text_config is not None else vlm_config.text_config
+        self.text_config = text_config if text_config is not None else vlm_config.text_config
         if isinstance(self.text_config, dict):
             text_config["model_type"] = text_config["model_type"] if "model_type" in text_config else "gemma"
             self.text_config = CONFIG_MAPPING[text_config["model_type"]](**text_config)