-
Notifications
You must be signed in to change notification settings - Fork 32k
Add ColQwen2 to 🤗 transformers #35778
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
68 commits
Select commit
Hold shift + click to select a range
551e46c
feat: add colqwen2 (wip)
tonywu71 154d843
tests: fix test_attention_outputs
tonywu71 055eb5e
tests: reduce hidden size to accelerate tests
tonywu71 c0c7248
tests: fix `test_attention_outputs` 🥳
tonywu71 0a1e9f0
fix: fix wrong parent class for `ColQwen2ForRetrievalOutput`
tonywu71 99a5961
fix: minor typing and style changes
tonywu71 c731365
chore: run `make style`
tonywu71 5985784
feat: remove redundant `max_num_visual_tokens` attribute in `ColQwen2…
tonywu71 c6567d4
tests: tweak comments
tonywu71 0cb74d9
style: apply ruff formatter
tonywu71 6109920
feat: move default values for `visual_prompt_prefix` and `query_prefix`
tonywu71 b090847
docs: update ColQwen2 model card
tonywu71 607cd78
docs: tweak model cards
tonywu71 6c261cf
docs: add required example config checkpoint
tonywu71 b027a9d
tests: update expected scores in integration test
tonywu71 0302b12
docs: tweak quickstart snippets
tonywu71 5eaa32b
fix: address PR comments
tonywu71 ebb89b5
tests: fix colqwen2 tests + tweak comment in colpali test
tonywu71 bdbaa2b
tests: unskip useful tests
tonywu71 6fbac2a
fix: fix bug when `visual_prompt_prefix` or `query_prefix` is an empt…
tonywu71 6931500
fix: fix ColPali outputs when `return_dict == False`
tonywu71 985575c
fix: fix issue with PaliGemma output not being a dict
tonywu71 68ba7b8
docs: set default dtype to bfloat16 in quickstart snippets
tonywu71 bae3119
fix: fix error when `return_dict=False` in ColPali and ColQwen2
tonywu71 7dcc1e0
tests: fix special tokens not being replaced in input_ids
tonywu71 17882c2
style: fix lint
tonywu71 da93dcf
fix: `ColQwen2Processor`'s `padding_side` is now set from `processor_…
tonywu71 2b1ef88
fix: remove unused `padding_side` in ColQwen2 model
tonywu71 60d4033
docs: update ColQwen2's model doc
tonywu71 bb27ef9
fix: fix harcoded vlm backbone class in ColQwen2Config
tonywu71 a31b2f3
fix: remove `padding_side` from ColQwen2Processor as should fed from …
tonywu71 45fba97
docs: fix typo in model docstring
tonywu71 78d051d
docs: add illuin mention in model docs
tonywu71 ee9800b
fix: let `padding_size` be handled by `tokenizer_config.json`
tonywu71 4f76803
docs: add colpali reference url in colqwen2's model doc
tonywu71 cb924e3
docs: add Hf mention in model docs
tonywu71 f8f8261
docs: add late interaction mention in model docs
tonywu71 824b331
docs: tweak colqwen2 model doc
tonywu71 6dc1d22
docs: update reference checkpoint for ColPali to v1.3
tonywu71 d325c01
docs: simplify quickstart snippets
tonywu71 61a578c
docs: remove redundant `.eval()`
tonywu71 ff59eb2
refactor: use `can_return_tuple` decorator for ColPali and ColQwen2
tonywu71 f48568b
docs: fix copyright date
tonywu71 45d1dbe
docs: add missing copyright in tests
tonywu71 7b0f900
fix: raise error when `initializer_range` is not in config
tonywu71 f171ed6
docs: remove redundant `.eval()` in colpali doc
tonywu71 eaa797b
fix: fix `get_text_config` now that Qwen2VL has a proper `text_config…
tonywu71 c8e360f
fix: add missing `initializer_range` attribute in `ColQwen2Config`
tonywu71 14d7b5c
fix: use `get_text_config` in `resize_token_embeddings`
tonywu71 0686b2a
Merge remote-tracking branch 'upstream/main' into add-colqwen2
yonigozlan 10b3ddb
update colwen2 with auto_docstring
yonigozlan bdef63f
docs: fix wrong copyright year
tonywu71 4b7f635
chore: remove `raise` as `initializer_range` has a default value in `…
tonywu71 c638c07
refactor: merge `inner_forward` into `forward`
tonywu71 30d2080
Merge remote-tracking branch 'upstream/main' into add-colqwen2
yonigozlan 8277c43
Refactor colqwen2 after refactoring of qwen2VL, use modular for model…
yonigozlan 86e0693
protect torch import in modular to protect in processing
yonigozlan c0a6442
protect torch import in modular to protect in processing
yonigozlan 98a5338
Merge branch 'add-colqwen2' of https://github.com/tonywu71/transforme…
yonigozlan 4aa5aa0
tests: fix hf model path in ColQwen2 integration test
tonywu71 34ca1e7
docs: clarify `attn_implementation` and add comments
tonywu71 43af0ad
docs: add fallback snippet for using offline PIL dummy images
tonywu71 0356f3c
docs: temporarily revert attn_implementation to `None` while sdpa is …
tonywu71 7a4218b
docs: tweaks in colpali/colqwen2 quick start snippets
tonywu71 58c7ff2
fix: add missing flags to enable SDPA/Flex Attention in ColQwen2 model
tonywu71 3852c86
fix: add missing changes in modular file
tonywu71 bd65ad3
Merge remote-tracking branch 'upstream/main' into add-colqwen2
yonigozlan 1bc3dea
fix modeling tests
yonigozlan File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,176 @@ | ||
| <!--Copyright 2025 The HuggingFace Team. All rights reserved. | ||
|
|
||
| Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | ||
| the License. You may obtain a copy of the License at | ||
|
|
||
| http://www.apache.org/licenses/LICENSE-2.0 | ||
|
|
||
| Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | ||
| an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | ||
| specific language governing permissions and limitations under the License. | ||
|
|
||
| ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be | ||
| rendered properly in your Markdown viewer. | ||
|
|
||
| --> | ||
|
|
||
| <div style="float: right;"> | ||
| <div class="flex flex-wrap space-x-1"> | ||
| <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white"> | ||
| </div> | ||
| </div> | ||
|
|
||
| # ColQwen2 | ||
|
|
||
| [ColQwen2](https://doi.org/10.48550/arXiv.2407.01449) is a variant of the [ColPali](./colpali) model designed to retrieve documents by analyzing their visual features. Unlike traditional systems that rely heavily on text extraction and OCR, ColQwen2 treats each page as an image. It uses the [Qwen2-VL](./qwen2_vl) backbone to capture not only text, but also the layout, tables, charts, and other visual elements to create detailed multi-vector embeddings that can be used for retrieval by computing pairwise late interaction similarity scores. This offers a more comprehensive understanding of documents and enables more efficient and accurate retrieval. | ||
|
|
||
| This model was contributed by [@tonywu71](https://huggingface.co/tonywu71) (ILLUIN Technology) and [@yonigozlan](https://huggingface.co/yonigozlan) (HuggingFace). | ||
|
|
||
| You can find all the original ColPali checkpoints under Vidore's [Hf-native ColVision Models](https://huggingface.co/collections/vidore/hf-native-colvision-models-6755d68fc60a8553acaa96f7) collection. | ||
|
|
||
| > [!TIP] | ||
| > Click on the ColQwen2 models in the right sidebar for more examples of how to use ColQwen2 for image retrieval. | ||
|
|
||
| <hfoptions id="usage"> | ||
| <hfoption id="image retrieval"> | ||
|
|
||
| ```python | ||
| import requests | ||
| import torch | ||
| from PIL import Image | ||
|
|
||
| from transformers import ColQwen2ForRetrieval, ColQwen2Processor | ||
| from transformers.utils.import_utils import is_flash_attn_2_available | ||
|
|
||
|
|
||
| # Load the model and the processor | ||
| model_name = "vidore/colqwen2-v1.0-hf" | ||
|
|
||
| model = ColQwen2ForRetrieval.from_pretrained( | ||
| model_name, | ||
| torch_dtype=torch.bfloat16, | ||
| device_map="auto", # "cpu", "cuda", or "mps" for Apple Silicon | ||
| attn_implementation="flash_attention_2" if is_flash_attn_2_available() else "sdpa", | ||
| ) | ||
| processor = ColQwen2Processor.from_pretrained(model_name) | ||
|
|
||
| # The document page screenshots from your corpus | ||
| url1 = "https://upload.wikimedia.org/wikipedia/commons/8/89/US-original-Declaration-1776.jpg" | ||
| url2 = "https://upload.wikimedia.org/wikipedia/commons/thumb/4/4c/Romeoandjuliet1597.jpg/500px-Romeoandjuliet1597.jpg" | ||
|
|
||
| images = [ | ||
| Image.open(requests.get(url1, stream=True).raw), | ||
| Image.open(requests.get(url2, stream=True).raw), | ||
| ] | ||
|
|
||
| # The queries you want to retrieve documents for | ||
| queries = [ | ||
| "When was the United States Declaration of Independence proclaimed?", | ||
| "Who printed the edition of Romeo and Juliet?", | ||
| ] | ||
|
|
||
| # Process the inputs | ||
| inputs_images = processor(images=images).to(model.device) | ||
| inputs_text = processor(text=queries).to(model.device) | ||
|
|
||
| # Forward pass | ||
| with torch.no_grad(): | ||
| image_embeddings = model(**inputs_images).embeddings | ||
| query_embeddings = model(**inputs_text).embeddings | ||
|
|
||
| # Score the queries against the images | ||
| scores = processor.score_retrieval(query_embeddings, image_embeddings) | ||
|
|
||
| print("Retrieval scores (query x image):") | ||
| print(scores) | ||
| ``` | ||
|
|
||
| If you have issue with loading the images with PIL, you can use the following code to create dummy images: | ||
|
|
||
| ```python | ||
| images = [ | ||
| Image.new("RGB", (128, 128), color="white"), | ||
| Image.new("RGB", (64, 32), color="black"), | ||
| ] | ||
| ``` | ||
|
|
||
| </hfoption> | ||
| </hfoptions> | ||
|
|
||
| Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends. | ||
|
|
||
| The example below uses [bitsandbytes](../quantization/bitsandbytes.md) to quantize the weights to int4. | ||
|
|
||
| ```python | ||
| import requests | ||
| import torch | ||
| from PIL import Image | ||
|
|
||
| from transformers import BitsAndBytesConfig, ColQwen2ForRetrieval, ColQwen2Processor | ||
|
|
||
|
|
||
| model_name = "vidore/colqwen2-v1.0-hf" | ||
|
|
||
| # 4-bit quantization configuration | ||
| bnb_config = BitsAndBytesConfig( | ||
| load_in_4bit=True, | ||
| bnb_4bit_use_double_quant=True, | ||
| bnb_4bit_quant_type="nf4", | ||
| bnb_4bit_compute_dtype=torch.float16, | ||
| ) | ||
|
|
||
| model = ColQwen2ForRetrieval.from_pretrained( | ||
| model_name, | ||
| quantization_config=bnb_config, | ||
| device_map="cuda", | ||
| ).eval() | ||
|
|
||
| processor = ColQwen2Processor.from_pretrained(model_name) | ||
|
|
||
| url1 = "https://upload.wikimedia.org/wikipedia/commons/8/89/US-original-Declaration-1776.jpg" | ||
| url2 = "https://upload.wikimedia.org/wikipedia/commons/thumb/4/4c/Romeoandjuliet1597.jpg/500px-Romeoandjuliet1597.jpg" | ||
|
|
||
| images = [ | ||
| Image.open(requests.get(url1, stream=True).raw), | ||
| Image.open(requests.get(url2, stream=True).raw), | ||
| ] | ||
|
|
||
| queries = [ | ||
| "When was the United States Declaration of Independence proclaimed?", | ||
| "Who printed the edition of Romeo and Juliet?", | ||
| ] | ||
|
|
||
| # Process the inputs | ||
| inputs_images = processor(images=images, return_tensors="pt").to(model.device) | ||
| inputs_text = processor(text=queries, return_tensors="pt").to(model.device) | ||
|
|
||
| # Forward pass | ||
| with torch.no_grad(): | ||
| image_embeddings = model(**inputs_images).embeddings | ||
| query_embeddings = model(**inputs_text).embeddings | ||
|
|
||
| # Score the queries against the images | ||
| scores = processor.score_retrieval(query_embeddings, image_embeddings) | ||
|
|
||
| print("Retrieval scores (query x image):") | ||
| print(scores) | ||
| ``` | ||
|
|
||
| ## Notes | ||
|
|
||
| - [`~ColQwen2Processor.score_retrieval`] returns a 2D tensor where the first dimension is the number of queries and the second dimension is the number of images. A higher score indicates more similarity between the query and image. | ||
| - Unlike ColPali, ColQwen2 supports arbitrary image resolutions and aspect ratios, which means images are not resized into fixed-size squares. This preserves more of the original input signal. | ||
| - Larger input images generate longer multi-vector embeddings, allowing users to adjust image resolution to balance performance and memory usage. | ||
|
|
||
| ## ColQwen2Config | ||
|
|
||
| [[autodoc]] ColQwen2Config | ||
|
|
||
| ## ColQwen2Processor | ||
|
|
||
| [[autodoc]] ColQwen2Processor | ||
|
|
||
| ## ColQwen2ForRetrieval | ||
|
|
||
| [[autodoc]] ColQwen2ForRetrieval | ||
| - forward |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.