Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
8 changes: 4 additions & 4 deletions .github/pull_request_template.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,12 +29,12 @@
- [ ] I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
- [ ] If the dataset is too big (e.g. >2048 examples), considering using `self.stratified_subsampling() under dataset_transform()`
- [ ] I have filled out the metadata object in the dataset file (find documentation on it [here](https://github.com/embeddings-benchmark/mteb/blob/main/docs/adding_a_dataset.md#2-creating-the-metadata-object)).
- [ ] Run tests locally to make sure nothing is broken using `make test`.
- [ ] Run the formatter to format the code using `make lint`.
- [ ] Run tests locally to make sure nothing is broken using `make test`.
- [ ] Run the formatter to format the code using `make lint`.


### Adding a model checklist
<!--
<!--
When adding a model to the model registry
see also https://github.com/embeddings-benchmark/mteb/blob/main/docs/reproducible_workflow.md
-->
Expand All @@ -43,4 +43,4 @@ see also https://github.com/embeddings-benchmark/mteb/blob/main/docs/reproducibl
- [ ] I have ensured that my model can be loaded using
- [ ] `mteb.get_model(model_name, revision)` and
- [ ] `mteb.get_model_meta(model_name, revision)`
- [ ] I have tested the implementation works on a representative set of tasks.
- [ ] I have tested the implementation works on a representative set of tasks.
3 changes: 1 addition & 2 deletions .github/workflows/docs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ jobs:
- name: Create table
run: |
make build-docs

- name: Push table
run: |
git config --global user.email "github-actions[bot]@users.noreply.github.com"
Expand All @@ -60,4 +60,3 @@ jobs:
git commit -m "Update tasks table"
git push
fi

4 changes: 2 additions & 2 deletions .github/workflows/leaderboard_refresh.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@ name: Daily Space Rebuild
on:
schedule:
# Runs at midnight Pacific Time (8 AM UTC)
- cron: '0 8 * * *'
workflow_dispatch: # Allows manual triggering
- cron: "0 8 * * *"
workflow_dispatch: # Allows manual triggering

jobs:
rebuild:
Expand Down
1 change: 0 additions & 1 deletion .github/workflows/lint.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,4 +25,3 @@ jobs:
id: lint
run: |
make lint-check

22 changes: 11 additions & 11 deletions .github/workflows/model_loading.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,22 +3,22 @@ name: Model Loading
on:
pull_request:
paths:
- 'mteb/models/**.py'
- "mteb/models/**.py"

jobs:
extract-and-run:
runs-on: ubuntu-latest

steps:
- name: Checkout repository
uses: actions/checkout@v3
- name: Checkout repository
uses: actions/checkout@v3

- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.10'
cache: 'pip'
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: "3.10"
cache: "pip"

- name: Install dependencies and run tests
run: |
make model-load-test BASE_BRANCH=${{ github.event.pull_request.base.ref }}
- name: Install dependencies and run tests
run: |
make model-load-test BASE_BRANCH=${{ github.event.pull_request.base.ref }}
7 changes: 3 additions & 4 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,7 @@ jobs:
runs-on: ubuntu-latest
concurrency: release
permissions:
id-token: write # IMPORTANT: this permission is mandatory for trusted publishing using PyPI

id-token: write # IMPORTANT: this permission is mandatory for trusted publishing using PyPI

if: ${{ github.ref == 'refs/heads/main' && github.event.workflow_run.conclusion == 'success'}}
steps:
Expand All @@ -40,8 +39,8 @@ jobs:
- name: Publish package distributions to PyPI
uses: pypa/gh-action-pypi-publish@release/v1
if: steps.release.outputs.released == 'true'
# This action supports PyPI's trusted publishing implementation, which allows authentication to PyPI without a manually
# configured API token or username/password combination. To perform trusted publishing with this action, your project's
# This action supports PyPI's trusted publishing implementation, which allows authentication to PyPI without a manually
# configured API token or username/password combination. To perform trusted publishing with this action, your project's
# publisher must already be configured on PyPI.

- name: Publish package distributions to GitHub Releases
Expand Down
4 changes: 1 addition & 3 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@
# 1) install Python dependencies
# 2) run make test


name: Test
on:
push:
Expand Down Expand Up @@ -30,7 +29,7 @@ jobs:
with:
python-version: ${{ matrix.python-version }}
cache: "pip"

- name: Install dependencies
shell: bash
run: |
Expand All @@ -53,4 +52,3 @@ jobs:
# if it fails again, the workflow will fail.
# If it passes the first time the test will not run again
make test || make test

2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -151,4 +151,4 @@ model_names.txt
mteb/leaderboard/__cached_results.json

# gradio
.gradio/
.gradio/
31 changes: 31 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
fail_fast: true

repos:
- repo: https://github.com/abravalheri/validate-pyproject
rev: v0.23
hooks:
- id: validate-pyproject

- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v2.3.0
hooks:
- id: check-yaml
- id: check-json
- id: pretty-format-json
args:
- "--autofix"
- "--indent=4"
- "--no-sort-keys"
- id: end-of-file-fixer # generated a lot of changes
- id: trailing-whitespace
- id: check-toml

- repo: local
hooks:
- id: lint
name: lint
description: "Run 'make lint'"
entry: make lint
language: python
types_or: [python]
minimum_pre_commit_version: "2.9.2"
2 changes: 1 addition & 1 deletion .vscode/extensions.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@
"recommendations": [
"charliermarsh.ruff"
]
}
}
2 changes: 1 addition & 1 deletion .vscode/settings.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,5 +4,5 @@
],
"python.testing.unittestEnabled": false,
"python.testing.pytestEnabled": true,
"editor.defaultFormatter": "charliermarsh.ruff",
"editor.defaultFormatter": "charliermarsh.ruff"
}
4 changes: 2 additions & 2 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
## Contributing to MTEB
We welcome contributions such as new datasets to MTEB! Please see detailed see the related [issue](https://github.com/embeddings-benchmark/mteb/issues/360) for more information.
We welcome contributions such as new datasets to MTEB! Please see detailed see the related [issue](https://github.com/embeddings-benchmark/mteb/issues/360) for more information.

Once you have decided on your contribution, this document describes how to set up the repository for development.

Expand Down Expand Up @@ -41,4 +41,4 @@ MTEB follows [semantic versioning](https://semver.org/). This means that the ver

Any commit with one of these prefixes will trigger a version bump upon merging to the main branch as long as tests pass. A version bump will then trigger a new release on PyPI as well as a new release on GitHub.

Other prefixes will not trigger a version bump. For example, `docs:`, `chore:`, `refactor:`, etc., however they will structure the commit history and the changelog. You can find more information about this in the [python-semantic-release documentation](https://python-semantic-release.readthedocs.io/en/latest/). If you do not intend to trigger a version bump you're not required to follow this convention when contributing to MTEB.
Other prefixes will not trigger a version bump. For example, `docs:`, `chore:`, `refactor:`, etc., however they will structure the commit history and the changelog. You can find more information about this in the [python-semantic-release documentation](https://python-semantic-release.readthedocs.io/en/latest/). If you do not intend to trigger a version bump you're not required to follow this convention when contributing to MTEB.
14 changes: 11 additions & 3 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
install:
@echo "--- 🚀 Installing project dependencies ---"
pip install -e ".[dev]"
pre-commit install

install-for-tests:
@echo "--- 🚀 Installing project dependencies for test ---"
Expand All @@ -10,7 +11,7 @@ install-for-tests:
lint:
@echo "--- 🧹 Running linters ---"
ruff format . # running ruff formatting
ruff check . --fix # running ruff linting
ruff check . --fix --exit-non-zero-on-fix # running ruff linting # --exit-non-zero-on-fix is used for the pre-commit hook to work

lint-check:
@echo "--- 🧹 Check is project is linted ---"
Expand All @@ -22,9 +23,10 @@ test:
@echo "--- 🧪 Running tests ---"
pytest -n auto -m "not test_datasets"


test-with-coverage:
@echo "--- 🧪 Running tests with coverage ---"
pytest -n auto --cov-report=term-missing --cov-config=pyproject.toml --cov=mteb
pytest -n auto --cov-report=term-missing --cov-config=pyproject.toml --cov=mteb

pr:
@echo "--- 🚀 Running requirements for a PR ---"
Expand Down Expand Up @@ -52,4 +54,10 @@ dataset-load-test:

run-leaderboard:
@echo "--- 🚀 Running leaderboard locally ---"
python -m mteb.leaderboard.app
python -m mteb.leaderboard.app


.PHONY: check
check: ## Run code quality tools.
@echo "--- 🧹 Running code quality tools ---"
@pre-commit run -a
22 changes: 11 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ evaluation = mteb.MTEB(tasks=tasks)
```

In prompts the key can be:
1. Prompt types (`passage`, `query`) - they will be used in reranking and retrieval tasks
1. Prompt types (`passage`, `query`) - they will be used in reranking and retrieval tasks
2. Task type - these prompts will be used in all tasks of the given type
1. `BitextMining`
2. `Classification`
Expand Down Expand Up @@ -103,7 +103,7 @@ mteb run -m sentence-transformers/all-MiniLM-L6-v2 \
## Usage Documentation
Click on each section below to see the details.

<br />
<br />

<details>
<summary> Task selection </summary>
Expand Down Expand Up @@ -159,7 +159,7 @@ tasks = mteb.get_tasks(modalities=["text", "image"]) # Only select tasks with te
You can also specify exclusive modality filtering to only get tasks with exactly the requested modalities (default behavior with exclusive_modality_filter=False):
```python
# Get tasks with text modality, this will also include tasks having both text and image modalities
tasks = mteb.get_tasks(modalities=["text"], exclusive_modality_filter=False)
tasks = mteb.get_tasks(modalities=["text"], exclusive_modality_filter=False)

# Get tasks that have ONLY text modality (no image or other modalities)
tasks = mteb.get_tasks(modalities=["text"], exclusive_modality_filter=True)
Expand All @@ -172,7 +172,7 @@ tasks = mteb.get_tasks(modalities=["text"], exclusive_modality_filter=True)

### Running a Benchmark

`mteb` comes with a set of predefined benchmarks. These can be fetched using `get_benchmark` and run in a similar fashion to other sets of tasks.
`mteb` comes with a set of predefined benchmarks. These can be fetched using `get_benchmark` and run in a similar fashion to other sets of tasks.
For instance to select the 56 English datasets that form the "Overall MTEB English leaderboard":

```python
Expand Down Expand Up @@ -262,13 +262,13 @@ class CustomModel:
**kwargs,
) -> np.ndarray:
"""Encodes the given sentences using the encoder.

Args:
sentences: The sentences to encode.
task_name: The name of the task.
prompt_type: The prompt type to use.
**kwargs: Additional arguments to pass to the encoder.

Returns:
The encoded sentences.
"""
Expand Down Expand Up @@ -312,7 +312,7 @@ evaluation.run(model)

### Using a cross encoder for reranking

To use a cross encoder for reranking, you can directly use a CrossEncoder from SentenceTransformers. The following code shows a two-stage run with the second stage reading results saved from the first stage.
To use a cross encoder for reranking, you can directly use a CrossEncoder from SentenceTransformers. The following code shows a two-stage run with the second stage reading results saved from the first stage.

```python
from mteb import MTEB
Expand Down Expand Up @@ -468,7 +468,7 @@ model_w_contamination = ModelMeta(
### Running the Leaderboard

It is possible to completely deploy the leaderboard locally or self-host it. This can e.g. be relevant for companies that might want to
integrate build their own benchmarks or integrate custom tasks into existing benchmarks.
integrate build their own benchmarks or integrate custom tasks into existing benchmarks.

Running the leaderboard is quite easy. Simply run:
```py
Expand All @@ -494,12 +494,12 @@ There are times you may want to cache the embeddings so you can re-use them. Thi
from mteb.models.cache_wrapper import CachedEmbeddingWrapper
model_with_cached_emb = CachedEmbeddingWrapper(model, cache_path='path_to_cache_dir')
# run as normal
evaluation.run(model, ...)
evaluation.run(model, ...)
```

</details>

<br />
<br />



Expand Down Expand Up @@ -540,7 +540,7 @@ MTEB was introduced in "[MTEB: Massive Text Embedding Benchmark](https://arxiv.o
author = {Muennighoff, Niklas and Tazi, Nouamane and Magne, Lo{\"\i}c and Reimers, Nils},
title = {MTEB: Massive Text Embedding Benchmark},
publisher = {arXiv},
journal={arXiv preprint arXiv:2210.07316},
journal={arXiv preprint arXiv:2210.07316},
year = {2022}
}
```
Expand Down
2 changes: 1 addition & 1 deletion docs/adding_a_benchmark.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,4 @@ The MTEB Leaderboard is available [here](https://huggingface.co/spaces/mteb/lead

1. Add your benchmark to [benchmark.py](../mteb/benchmarks/benchmarks.py) as a `Benchmark` object, and select the MTEB tasks that will be in the benchmark. If some of the tasks do not exist in MTEB, follow the "add a dataset" instructions to add them.
2. Open a PR at https://github.com/embedding-benchmark/results with results of models on your benchmark.
3. When PRs are merged, your benchmark will be added to the leaderboard automatically after the next workflow trigger.
3. When PRs are merged, your benchmark will be added to the leaderboard automatically after the next workflow trigger.
12 changes: 6 additions & 6 deletions docs/adding_a_model.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ The MTEB Leaderboard is available [here](https://huggingface.co/spaces/mteb/lead
1. **Add meta information about your model to [model dir](../mteb/models/)**. See the docstring of ModelMeta for meta data details.
```python
from mteb.model_meta import ModelMeta

bge_m3 = ModelMeta(
name="model_name",
languages=["model_languages"], # in format eng-Latn
Expand All @@ -31,12 +31,12 @@ The MTEB Leaderboard is available [here](https://huggingface.co/spaces/mteb/lead
from mteb.models.wrapper import Wrapper
from mteb.encoder_interface import PromptType
import numpy as np

class CustomWrapper(Wrapper):
def __init__(self, model_name, model_revision):
super().__init__(model_name, model_revision)
# your custom implementation here

def encode(
self,
sentences: list[str],
Expand All @@ -52,7 +52,7 @@ The MTEB Leaderboard is available [here](https://huggingface.co/spaces/mteb/lead
```python
your_model = ModelMeta(
loader=partial(
CustomWrapper,
CustomWrapper,
model_name="model_name",
model_revision="5617a9f61b028005a4858fdac845db406aefb181"
),
Expand All @@ -70,7 +70,7 @@ import mteb
model = mteb.get_model("sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")

tasks = mteb.get_tasks(...) # get specific tasks
# or
# or
tasks = mteb.get_benchmark("MTEB(eng, classic)") # or use a specific benchmark

evaluation = mteb.MTEB(tasks=tasks)
Expand All @@ -95,7 +95,7 @@ To add results to the public leaderboard you can push your results to the [resul

##### Using Prompts with Sentence Transformers

If your model uses Sentence Transformers and requires different prompts for encoding the queries and corpus, you can take advantage of the `prompts` [parameter](https://sbert.net/docs/package_reference/sentence_transformer/SentenceTransformer.html#sentence_transformers.SentenceTransformer).
If your model uses Sentence Transformers and requires different prompts for encoding the queries and corpus, you can take advantage of the `prompts` [parameter](https://sbert.net/docs/package_reference/sentence_transformer/SentenceTransformer.html#sentence_transformers.SentenceTransformer).

Internally, `mteb` uses `query` for encoding the queries and `passage` as the prompt names for encoding the corpus. This is aligned with the default names used by Sentence Transformers.

Expand Down
2 changes: 1 addition & 1 deletion docs/benchmarks.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,4 +30,4 @@ The following table gives you an overview of the benchmarks in MTEB.
| [MTEB(rus)](https://aclanthology.org/2023.eacl-main.148/) | 23 | {'Classification': 9, 'Clustering': 3, 'MultilabelClassification': 2, 'PairClassification': 1, 'Reranking': 2, 'Retrieval': 3, 'STS': 3} | [Web, Social, Academic, Written, Blog, News, Spoken, Reviews, Encyclopaedic] | rus |
| [NanoBEIR](https://huggingface.co/collections/zeta-alpha-ai/nanobeir-66e1a0af21dfd93e620cd9f6) | 13 | {'Retrieval': 13} | [Web, Academic, Social, Medical, Written, Non-fiction, News, Encyclopaedic] | eng |
| [RAR-b](https://arxiv.org/abs/2404.06347) | 17 | {'Retrieval': 17} | [Encyclopaedic, Written, Programming] | eng |
<!-- BENCHMARKS TABLE END -->
<!-- BENCHMARKS TABLE END -->
Loading