Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 9 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
# ColPali Cookbooks 👨🏻‍🍳

[![GitHub](https://img.shields.io/badge/ColPali_Cookbooks-100000?style=for-the-badge&logo=github&logoColor=white)](https://github.com/tonywu71/colpali-cookbooks)
[![arXiv](https://img.shields.io/badge/arXiv-2407.01449-b31b1b.svg?style=for-the-badge)](https://arxiv.org/abs/2407.01449)
[![Hugging Face](https://img.shields.io/badge/Vidore-FFD21E?style=for-the-badge&logo=huggingface&logoColor=000)](https://huggingface.co/vidore)
[![X](https://img.shields.io/badge/Thread-%23000000?style=for-the-badge&logo=X&logoColor=white)](https://x.com/tonywu_71/status/1809183824464560138)
Expand All @@ -10,19 +9,22 @@

## Introduction

With our new model *ColPali*, we propose to leverage VLMs to construct efficient multi-vector embeddings in the visual space for document retrieval. By feeding the ViT output patches from PaliGemma-3B to a linear projection, we create a multi-vector representation of documents. We train the model to maximize the similarity between these document embeddings and the query embeddings, following the ColBERT method.

Using ColPali removes the need for potentially complex and brittle layout recognition and OCR pipelines with a single model that can take into account both the textual and visual content (layout, charts, ...) of a document.
[ColPali](https://huggingface.co/papers/2407.01449) is a model designed to retrieve documents by analyzing their visual features. Unlike traditional systems that rely heavily on text extraction and OCR, ColPali treats each page as an image. It uses [Paligemma-3B](./paligemma) to capture not only text, but also the layout, tables, charts, and other visual elements to create detailed multi-vector embeddings that can be used for retrieval by computing pairwise late interaction similarity scores. This offers a more comprehensive understanding of documents and enables more efficient and accurate retrieval.

This repository contains notebooks for learning about the ColVision family of models, fine-tuning them for your specific use case, creating similarity maps to interpret their predictions, and more! 😍

## Table of Contents

You can find the cookbooks in the [`examples`](https://github.com/tonywu71/colpali-cookbooks/tree/main/examples) directory. In the table below, they are listed from most recent to oldest.

| Task | Notebook | Description |
|-----------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------|
| Inference, interpretability | [Use the 🤗 transformers-native ColQwen2](https://github.com/tonywu71/colpali-cookbooks/blob/main/examples/use_transformers_native_colqwen2.ipynb) | Use the 🤗 transformers-native implementation of ColQwen2 for inference, scoring, and interpretability. |
| Inference, interpretability | [Use the 🤗 transformers-native ColPali](https://github.com/tonywu71/colpali-cookbooks/blob/main/examples/use_transformers_native_colpali.ipynb) | Use the 🤗 transformers-native implementation of ColPali for inference, scoring, and interpretability. |
| RAG | [ColQwen2: One model for your whole RAG pipeline with adapter hot-swapping 🔥](https://github.com/tonywu71/colpali-cookbooks/blob/main/examples/run_e2e_rag_colqwen2_with_adapter_hot_swapping.ipynb) | Save VRAM by using a unique VLM for your entire RAG pipeline. Works even on Colab's free T4 GPU! |
| Interpretability | [ColQwen2: Generate your own similarity maps 👀](https://github.com/tonywu71/colpali-cookbooks/blob/main/examples/gen_colqwen2_similarity_maps.ipynb) | Generate your own similarity maps to interpret ColQwen2's predictions. |
| Interpretability | [ColPali: Generate your own similarity maps 👀](https://github.com/tonywu71/colpali-cookbooks/blob/main/examples/gen_colpali_similarity_maps.ipynb) | Generate your own similarity maps to interpret ColPali's predictions. |
| Fine-tuning | [Fine-tune ColPali 🛠️](https://github.com/tonywu71/colpali-cookbooks/blob/main/examples/finetune_colpali.ipynb) | Fine-tune ColPali using LoRA and optional 4bit/8bit quantization. |
| Interpretability | [ColQwen2: Generate your own similarity maps 👀](https://github.com/tonywu71/colpali-cookbooks/blob/main/examples/gen_colqwen2_similarity_maps.ipynb) | Generate your own similarity maps to interpret ColQwen2's predictions. |
| RAG | [ColQwen2: One model for your whole RAG pipeline with adapter hot-swapping 🔥](https://github.com/tonywu71/colpali-cookbooks/blob/main/examples/run_e2e_rag_colqwen2_with_adapter_hot_swapping.ipynb) | Save VRAM by using a unique VLM for your entire RAG pipeline. Works even on Colab's free T4 GPU! |
| Inference, interpretability | [Use the 🤗 transformers-native ColPali](https://github.com/tonywu71/colpali-cookbooks/blob/main/examples/use_transformers_native_colpali.ipynb) | Use the 🤗 transformers-native implementation of ColPali for inference, scoring, and interpretability. |

## Instructions

Expand Down
10 changes: 7 additions & 3 deletions examples/gen_colpali_similarity_maps.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -607,7 +607,7 @@
},
{
"cell_type": "code",
"execution_count": 8,
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
Expand Down Expand Up @@ -644,11 +644,15 @@
}
],
"source": [
"# Use this cell output to choose a token using its index\n",
"query_content = processor.decode(batch_queries.input_ids[0]).replace(processor.tokenizer.pad_token, \"\")\n",
"# Remove the padding tokens and the query augmentation tokens\n",
"query_content = processor.decode(batch_queries.input_ids[0])\n",
"query_content = query_content.replace(processor.tokenizer.pad_token, \"\")\n",
"query_content = query_content.replace(processor.query_augmentation_token, \"\").strip()\n",
"\n",
"# Retokenize the cleaned query\n",
"query_tokens = processor.tokenizer.tokenize(query_content)\n",
"\n",
"# Use this cell output to choose a token using its index\n",
"pprint.pprint({idx: val for idx, val in enumerate(query_tokens)})"
]
},
Expand Down
10 changes: 7 additions & 3 deletions examples/gen_colqwen2_similarity_maps.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -529,7 +529,7 @@
},
{
"cell_type": "code",
"execution_count": 8,
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
Expand Down Expand Up @@ -566,11 +566,15 @@
}
],
"source": [
"# Use this cell output to choose a token using its index\n",
"query_content = processor.decode(batch_queries.input_ids[0]).replace(processor.tokenizer.pad_token, \"\")\n",
"# Remove the padding tokens and the query augmentation tokens\n",
"query_content = processor.decode(batch_queries.input_ids[0])\n",
"query_content = query_content.replace(processor.tokenizer.pad_token, \"\")\n",
"query_content = query_content.replace(processor.query_augmentation_token, \"\").strip()\n",
"\n",
"# Retokenize the cleaned query\n",
"query_tokens = processor.tokenizer.tokenize(query_content)\n",
"\n",
"# Use this cell output to choose a token using its index\n",
"pprint.pprint({idx: val for idx, val in enumerate(query_tokens)})"
]
},
Expand Down
Loading