Skip to content
Merged

Update #1578

Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
f1fe91f
[MIEB] Adding DataComp CLIP models (#1283)
isaac-chung Oct 8, 2024
b0bc4e2
[mieb] Any2TextMultipleChoice Abstask&Evaluator & four tasks in CV-be…
gowitheflow-1998 Oct 11, 2024
1b70f6d
[mieb] adding 10 tasks (#1290)
gowitheflow-1998 Oct 15, 2024
6e7dd3d
[mieb] Adding MOCOv3 models (#1293)
isaac-chung Oct 18, 2024
053b5be
[mieb] Add more Any2AnyRetrieval datasets (#1285)
Jamie-Stirling Oct 20, 2024
a3ec14d
[mieb] Add any2any multiple choice evaluator and abstask (and one tas…
Jamie-Stirling Oct 20, 2024
b73a133
[mieb] Fix FORB dataset (#1306)
isaac-chung Oct 22, 2024
a6f306f
[mieb] run tasks fix (#1302)
gowitheflow-1998 Oct 22, 2024
22751ca
[mieb] split RParisI2IRetrieval and ROxfordI2IRetrieval into easy, me…
Jamie-Stirling Oct 22, 2024
8065568
[mieb] run tasks small fix (#1310)
gowitheflow-1998 Oct 22, 2024
2011aa1
[mieb] Add VLM2vec (#1323)
isaac-chung Oct 25, 2024
93260cb
feat: Merge main into MIEB (#1329)
KennethEnevoldsen Oct 27, 2024
6979b2a
[mieb] Add OpenCLIP models (#1335)
isaac-chung Oct 28, 2024
45ffa44
[mieb] new version with downsampled train split to 32 per class (#1327)
isaac-chung Oct 28, 2024
8054607
[mieb] Fix Jina CLIP (#1349)
isaac-chung Oct 28, 2024
874c1bc
fix: Add clevr license (#1356)
KennethEnevoldsen Oct 29, 2024
cf8ea1f
Add BLINK as multi-choice tasks (#1348)
Jamie-Stirling Oct 29, 2024
6652e56
[mieb] add Eva CLIP models (#1369)
isaac-chung Oct 31, 2024
9b178e6
[mieb] add siglip, cohere multimodal & some fixes for final run (#1357)
gowitheflow-1998 Oct 31, 2024
4b0facc
[mieb] fixes for final run (#1374)
gowitheflow-1998 Nov 1, 2024
a449b24
Update run_vista.md
gowitheflow-1998 Nov 1, 2024
3a18fbd
[mieb] Fix torch no grad (#1378)
Muennighoff Nov 4, 2024
1ef93e4
[mieb] Fix vlm2vec (#1380)
isaac-chung Nov 5, 2024
34094ea
[mieb] Remove null entries from corpus of ROxford, RParis (#1371)
Jamie-Stirling Nov 5, 2024
2b56317
[mieb] fixes (#1390)
Muennighoff Nov 5, 2024
2862323
[MIEB] Remove non-existent method for blip (#1394)
imenelydiaker Nov 5, 2024
8a8b8b7
[mieb] fix ALIGN; update Winoground revision id; update run script (#…
gowitheflow-1998 Nov 6, 2024
01b7f28
[mieb] Fix open clip for cv bench count (#1397)
isaac-chung Nov 7, 2024
cdb92c6
[mieb] Update subtasks of BLINKIT2TMultiChoice and BLINKIT2IMultiChoi…
Jamie-Stirling Nov 7, 2024
a06227e
[mieb] Fix EVA CLIP for CV Bench (#1414)
isaac-chung Nov 10, 2024
f757892
[mieb] Add calculate probs for vlm2vec (#1418)
isaac-chung Nov 10, 2024
f60465a
[mieb] Fix siglip bug & add retrieval datasets (#1424)
gowitheflow-1998 Nov 10, 2024
f0dd6f6
[mieb] use Logistic Regression classifier for AbsTaskImageMultilabelC…
isaac-chung Nov 10, 2024
66176a0
[mieb] mieb scripts (siglip rerun & linear probing ablation & params …
gowitheflow-1998 Nov 10, 2024
7e0779a
[MIEB] Change Flickr30k to test split (#1449)
Jamie-Stirling Nov 15, 2024
1429cce
[mieb] Fix VLM2vec dtype (#1462)
isaac-chung Nov 18, 2024
2fc19e7
[mieb] run script for missing results (#1472)
gowitheflow-1998 Nov 18, 2024
fab0b82
[mieb] Fix Moco model on CIFAR10Clustering (#1487)
isaac-chung Nov 22, 2024
67a035d
[mieb] Fix Flickr30k I2T and T2I (#1505)
isaac-chung Nov 27, 2024
ff34ff6
[MIEB] add missing siglip models (#1533)
SaitejaUtpala Nov 30, 2024
dc35ce3
fix typo (#1535)
SaitejaUtpala Nov 30, 2024
c77b923
[mieb] Fix numbers of CIRR, Fashion200k, FashionIQ, Flickr30k, MSCOCO…
izhx Dec 4, 2024
d45fbb2
Add Voyage's multimodal embedding (#1555)
gowitheflow-1998 Dec 5, 2024
5f0b9c0
[mieb] update script for final re-run (#1576)
gowitheflow-1998 Dec 10, 2024
d2bb0ac
fix: no longer using same query text for all of BLINKIT2TMultiChoice …
Jamie-Stirling Dec 10, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
4 changes: 2 additions & 2 deletions .github/pull_request_template.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,6 @@ see also https://github.com/embeddings-benchmark/mteb/blob/main/docs/reproducibl

- [ ] I have filled out the ModelMeta object to the extent possible
- [ ] I have ensured that my model can be loaded using
- [ ] `mteb.get_model(model_name, revision_id)` and
- [ ] `mteb.get_model_meta(model_name, revision_id)`
- [ ] `mteb.get_model(model_name, revision)` and
- [ ] `mteb.get_model_meta(model_name, revision)`
- [ ] I have tested the implementation works on a representative set of tasks.
2 changes: 1 addition & 1 deletion .github/workflows/lint.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ jobs:

- uses: actions/setup-python@v4
with:
python-version: "3.8"
python-version: "3.9"
cache: "pip"

- name: Install dependencies
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/mmteb.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ jobs:

- uses: actions/setup-python@v4
with:
python-version: "3.8"
python-version: "3.9"
cache: "pip"

- name: Install dependencies
Expand All @@ -38,7 +38,7 @@ jobs:

- uses: actions/setup-python@v4
with:
python-version: "3.8"
python-version: "3.9"
cache: "pip"

- name: Install dependencies
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,11 +16,11 @@ jobs:
fail-fast: false
matrix:
os: [ubuntu-latest] #, macos-latest, windows-latest]
python-version: ["3.8", "3.9", "3.10"]
python-version: ["3.9", "3.10", "3.11", "3.12"]
include:
# Add Windows with Python 3.8 only to avoid tests taking too long
- os: windows-latest
python-version: "3.8"
python-version: "3.9"

steps:
- uses: actions/checkout@v3
Expand Down
155 changes: 106 additions & 49 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
<h4 align="center">
<p>
<a href="#installation">Installation</a> |
<a href="#usage">Usage</a> |
<a href="#usage-documentation">Usage</a> |
<a href="https://huggingface.co/spaces/mteb/leaderboard">Leaderboard</a> |
<a href="#documentation">Documentation</a> |
<a href="#citing">Citing</a>
Expand All @@ -36,9 +36,9 @@
pip install mteb
```

## Usage
## Example Usage

* Using a python script (see [scripts/run_mteb_english.py](https://github.com/embeddings-benchmark/mteb/blob/main/scripts/run_mteb_english.py) and [mteb/mtebscripts](https://github.com/embeddings-benchmark/mtebscripts) for more):
* Using a Python script:

```python
import mteb
Expand All @@ -55,6 +55,37 @@ evaluation = mteb.MTEB(tasks=tasks)
results = evaluation.run(model, output_folder=f"results/{model_name}")
```

<details>
<summary> Running SentneceTransformermer model with prompts </summary>

Prompts can be passed to the SentenceTransformer model using the `prompts` parameter. The following code shows how to use prompts with SentenceTransformer:

```python
from sentence_transformers import SentenceTransformer


model = SentenceTransformer("average_word_embeddings_komninos", prompts={"query": "Query:", "passage": "Passage:"})
evaluation = mteb.MTEB(tasks=tasks)
```

In prompts the key can be:
1. Prompt types (`passage`, `query`) - they will be used in reranking and retrieval tasks
2. Task type - these prompts will be used in all tasks of the given type
1. `BitextMining`
2. `Classification`
3. `MultilabelClassification`
4. `Clustering`
5. `PairClassification`
6. `Reranking`
7. `Retrieval`
8. `STS`
9. `Summarization`
10. `InstructionRetrieval`
3. Pair of task type and prompt type like `Retrival-query` - these prompts will be used in all classification tasks
4. Task name - these prompts will be used in the specific task
5. Pair of task name and prompt type like `NFCorpus-query` - these prompts will be used in the specific task
</details>

* Using CLI

```bash
Expand All @@ -71,17 +102,17 @@ mteb run -m sentence-transformers/all-MiniLM-L6-v2 \



## Advanced Usage
## Usage Documentation
Click on each section below to see the details.

<br />

<details>
<summary> Dataset selection </summary>
<summary> Task selection </summary>

### Dataset selection
### Task selection

Datasets can be selected by providing the list of datasets, but also
Tasks can be selected by providing the list of datasets, but also

* by their task (e.g. "Clustering" or "Classification")

Expand Down Expand Up @@ -121,11 +152,33 @@ evaluation = mteb.MTEB(tasks=[
# for an example of a HF subset see "Subset" in the dataset viewer at: https://huggingface.co/datasets/mteb/bucc-bitext-mining
```

There are also presets available for certain task collections, e.g. to select the 56 English datasets that form the "Overall MTEB English leaderboard":
</details>

<details>
<summary> Running a benchmark </summary>

### Running a Benchmark

`mteb` comes with a set of predefined benchmarks. These can be fetched using `get_benchmark` and run in a similar fashion to other sets of tasks.
For instance to select the 56 English datasets that form the "Overall MTEB English leaderboard":

```python
import mteb
benchmark = mteb.get_benchmark("MTEB(eng)")
evaluation = mteb.MTEB(tasks=benchmark)
```

The benchmark specified not only a list of tasks, but also what splits and language to run on. To get an overview of all available benchmarks simply run:

```python
from mteb import MTEB_MAIN_EN
evaluation = mteb.MTEB(tasks=MTEB_MAIN_EN, task_langs=["en"])
import mteb
benchmarks = mteb.get_benchmarks()
```

Generally we use the naming scheme for benchmarks `MTEB(*)`, where the "*" denotes the target of the benchmark. In the case of a language, we use the three-letter language code. For large groups of languages, we use the group notation, e.g., `MTEB(Scandinavian)` for Scandinavian languages. External benchmarks implemented in MTEB like `CoIR` use their original name. When using a benchmark from MTEB please cite `mteb` along with the citations of the benchmark which you can access using:

```python
benchmark.citation
```

</details>
Expand All @@ -139,7 +192,7 @@ evaluation = mteb.MTEB(tasks=MTEB_MAIN_EN, task_langs=["en"])
To pass in arguments to the model's `encode` function, you can use the encode keyword arguments (`encode_kwargs`):

```python
evaluation.run(model, encode_kwargs={"batch_size": 32}
evaluation.run(model, encode_kwargs={"batch_size": 32})
```
</details>

Expand Down Expand Up @@ -167,55 +220,35 @@ Note that the public leaderboard uses the test splits for all datasets except MS
Models should implement the following interface, implementing an `encode` function taking as inputs a list of sentences, and returning a list of embeddings (embeddings can be `np.array`, `torch.tensor`, etc.). For inspiration, you can look at the [mteb/mtebscripts repo](https://github.com/embeddings-benchmark/mtebscripts) used for running diverse models via SLURM scripts for the paper.

```python
class MyModel():
from mteb.encoder_interface import PromptType

class CustomModel:
def encode(
self, sentences: list[str], **kwargs: Any
) -> torch.Tensor | np.ndarray:
self,
sentences: list[str],
task_name: str,
prompt_type: PromptType | None = None,
**kwargs,
) -> np.ndarray:
"""Encodes the given sentences using the encoder.

Args:
sentences: The sentences to encode.
task_name: The name of the task.
prompt_type: The prompt type to use.
**kwargs: Additional arguments to pass to the encoder.

Returns:
The encoded sentences.
"""
pass

model = MyModel()
model = CustomModel()
tasks = mteb.get_task("Banking77Classification")
evaluation = MTEB(tasks=tasks)
evaluation.run(model)
```

If you'd like to use different encoding functions for query and corpus when evaluating on Retrieval or Reranking tasks, you can add separate methods for `encode_queries` and `encode_corpus`. If these methods exist, they will be automatically used for those tasks. You can refer to the `DRESModel` at `mteb/evaluation/evaluators/RetrievalEvaluator.py` for an example of these functions.

```python
class MyModel():
def encode_queries(self, queries: list[str], **kwargs) -> list[np.ndarray] | list[torch.Tensor]:
"""
Returns a list of embeddings for the given sentences.
Args:
queries: List of sentences to encode

Returns:
List of embeddings for the given sentences
"""
pass

def encode_corpus(self, corpus: list[str] | list[dict[str, str]], **kwargs) -> list[np.ndarray] | list[torch.Tensor]:
"""
Returns a list of embeddings for the given sentences.
Args:
corpus: List of sentences to encode
or list of dictionaries with keys "title" and "text"

Returns:
List of embeddings for the given sentences
"""
pass
```

</details>

<details>
Expand Down Expand Up @@ -297,7 +330,7 @@ from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

tasks = mteb.get_tasks( tasks=["NFCorpus"], languages=["eng"])
tasks = mteb.get_tasks(tasks=["NFCorpus"], languages=["eng"])

evaluation = MTEB(tasks=tasks)
evaluation.run(
Expand All @@ -309,7 +342,7 @@ evaluation.run(
```

CLI:
```
```bash
mteb run -t NFCorpus -m all-MiniLM-L6-v2 --output_folder results --save_predictions
```

Expand All @@ -318,9 +351,11 @@ mteb run -t NFCorpus -m all-MiniLM-L6-v2 --output_folder results --save_predicti
<details>
<summary> Fetching result from the results repository </summary>

Multiple models have already been run on tasks avaiable within MTEB. These results are available results [repository](https://github.com/embeddings-benchmark/results).
### Fetching results from the results repository

To make the results more easily accecible we have designed custom functionality for retrieving from the repository. For instance, you are selecting the best model for your French and English retrieval task on legal documents you could fetch the relevant tasks and create a dataframe of the results using the following code:
Multiple models have already been run on tasks available within MTEB. These results are available results [repository](https://github.com/embeddings-benchmark/results).

To make the results more easily accessible, we have designed custom functionality for retrieving from the repository. For instance, if you are selecting the best model for your French and English retrieval task on legal documents you could fetch the relevant tasks and create a dataframe of the results using the following code:

```python
import mteb
Expand All @@ -345,6 +380,26 @@ df = results_to_dataframe(results)

</details>

<details>
<summary> Caching Embeddings To Re-Use Them </summary>


### Caching Embeddings To Re-Use Them

There are times you may want to cache the embeddings so you can re-use them. This may be true if you have multiple query sets for the same corpus (e.g. Wikipedia) or are doing some optimization over the queries (e.g. prompting, other experiments). You can setup a cache by using a simple wrapper, which will save the cache per task in the `cache_embeddings/{task_name}` folder:

```python
# define your task and model above as normal
...
# wrap the model with the cache wrapper
from mteb.models.cache_wrapper import CachedEmbeddingWrapper
model_with_cached_emb = CachedEmbeddingWrapper(model, cache_path='path_to_cache_dir')
# run as normal
evaluation.run(model, ...)
```

</details>

<br />


Expand All @@ -354,6 +409,7 @@ df = results_to_dataframe(results)
| Documentation | |
| ------------------------------ | ---------------------- |
| 📋 [Tasks] | Overview of available tasks |
| 📐 [Benchmarks] | Overview of available benchmarks |
| 📈 [Leaderboard] | The interactive leaderboard of the benchmark |
| 🤖 [Adding a model] | Information related to how to submit a model to the leaderboard |
| 👩‍🔬 [Reproducible workflows] | Information related to how to reproduce and create reproducible workflows with MTEB |
Expand All @@ -363,6 +419,7 @@ df = results_to_dataframe(results)
| 🌐 [MMTEB] | An open-source effort to extend MTEB to cover a broad set of languages |  

[Tasks]: docs/tasks.md
[Benchmarks]: docs/benchmarks.md
[Contributing]: CONTRIBUTING.md
[Adding a model]: docs/adding_a_model.md
[Adding a dataset]: docs/adding_a_dataset.md
Expand Down
File renamed without changes.
4 changes: 2 additions & 2 deletions docs/adding_a_dataset.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ evaluation = MTEB(tasks=[MindSmallReranking()])
evaluation.run(model)
```

> **Note:** for multilingual / crosslingual tasks, make sure your class also inherits from the `MultilingualTask` class like in [this](https://github.com/embeddings-benchmark/mteb-draft/blob/main/mteb/tasks/Classification/MTOPIntentClassification.py) example.
> **Note:** for multilingual / crosslingual tasks, make sure your class also inherits from the `MultilingualTask` class like in [this](https://github.com/embeddings-benchmark/mteb/blob/main/mteb/tasks/Classification/multilingual/MTOPIntentClassification.py) example.



Expand All @@ -104,7 +104,7 @@ class VGClustering(AbsTaskClustering):
form="Written",
domains=["Academic", "Non-fiction"],
task_subtypes=["Scientific Reranking"],
license="cc-by-nc",
license="cc-by-nc-4.0",
annotations_creators="derived",
dialect=[],
text_creation="found",
Expand Down
29 changes: 21 additions & 8 deletions docs/adding_a_model.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,10 +29,7 @@ mteb run -m {model_name} -t {task_names}

These will save the results in a folder called `results/{model_name}/{model_revision}`.

For reference you can also look at [scripts/run_mteb_english.py](https://github.com/embeddings-benchmark/mteb/blob/main/scripts/run_mteb_english.py) for all MTEB English datasets used in the main ranking, or [scripts/run_mteb_chinese.py](https://github.com/embeddings-benchmark/mteb/blob/main/scripts/run_mteb_chinese.py) for the Chinese ones.
Advanced scripts with different models are available in the [mteb/mtebscripts repo](https://github.com/embeddings-benchmark/mtebscripts).

2. **Format the results using the CLI:**
1. **Format the results using the CLI:**

```bash
mteb create_meta --results_folder results/{model_name}/{model_revision} --output_path model_card.md
Expand All @@ -44,12 +41,28 @@ If readme of model exists:
mteb create_meta --results_folder results/{model_name}/{model_revision} --output_path model_card.md --from_existing your_existing_readme.md
```

3. **Add the frontmatter to model repository:**
2. **Add the frontmatter to model repository:**

Copy the content of the `model_card.md` file to the top of a `README.md` file of your model on the Hub. See [here](https://huggingface.co/Muennighoff/SGPT-5.8B-weightedmean-msmarco-specb-bitfit/blob/main/README.md) for an example.

4. **Wait for a refresh the leaderboard:**
3. **Wait for a refresh the leaderboard:**

The leaderboard [automatically refreshes daily](https://github.com/embeddings-benchmark/leaderboard/commits/main/) so once submitted you only need to wait for the automatic refresh. You can find the workflows for the leaderboard refresh [here](https://github.com/embeddings-benchmark/leaderboard/tree/main/.github/workflows). If you experience issues with the leaderboard please create an [issue](https://github.com/embeddings-benchmark/mteb/issues).

**Notes:**
- We remove models with scores that cannot be reproduced, so please ensure that your model is accessible and scores can be reproduced.
- An alternative way of submitting to the leaderboard is by opening a PR with your results [here](https://github.com/embeddings-benchmark/results) & checking that they are displayed correctly by [locally running the leaderboard](https://github.com/embeddings-benchmark/leaderboard?tab=readme-ov-file#developer-setup)

- ##### Using Prompts with Sentence Transformers

If your model uses Sentence Transformers and requires different prompts for encoding the queries and corpus, you can take advantage of the `prompts` [parameter](https://sbert.net/docs/package_reference/sentence_transformer/SentenceTransformer.html#sentence_transformers.SentenceTransformer).

Internally, `mteb` uses the prompt named `query` for encoding the queries and `passage` as the prompt name for encoding the corpus. This is aligned with the default names used by Sentence Transformers.

###### Adding the prompts in the model configuration (Preferred)

You can directly add the prompts when saving and uploading your model to the Hub. For an example, refer to this [configuration file](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v1.5/blob/3b5a16eaf17e47bd997da998988dce5877a57092/config_sentence_transformers.json).

The leaderboard will then automatically refresh daily so once submitted all you have to do is wait for the automatic refresh.
###### Instantiating the Model with Prompts

You can find the workflows for the leaderboard refresh [here](https://github.com/embeddings-benchmark/leaderboard/tree/main/.github/workflows). If you experience issues with the leaderboard please create an [issue](https://github.com/embeddings-benchmark/mteb/issues).
If you are unable to directly add the prompts in the model configuration, you can instantiate the model using the `sentence_transformers_loader` and pass `prompts` as an argument. For more details, see the `mteb/models/bge_models.py` file.
Loading