Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
9e2e972
bump ruff (#2784)
Samoed Jun 8, 2025
af7adbf
Update issue and pr templates (#2782)
Samoed Jun 9, 2025
8817670
model: Add GeoGPT-Research-Project/GeoEmbedding (#2773)
Hypothesis-Z Jun 9, 2025
1c08974
model: add fangxq/XYZ-embedding (#2741)
fangxiaoquan Jun 9, 2025
3d8dd9e
ci: fix config error for semantic release (#2800)
KennethEnevoldsen Jun 9, 2025
b8e64e1
dataset: Add R2MED Benchmark (#2795)
KennethEnevoldsen Jun 9, 2025
5e6aa9d
Update tasks & benchmarks tables
github-actions[bot] Jun 9, 2025
36a3c67
Update training datasets of GeoGPT-Research-Project/GeoEmbedding (#2802)
Hypothesis-Z Jun 10, 2025
fef1837
fix: Add adapted_from to Cmedqaretrieval (#2806)
KennethEnevoldsen Jun 10, 2025
e6238f2
1.38.28
invalid-email-address Jun 10, 2025
873ee76
fix: Adding client arg to init method of OpenAI models wrapper (#2803)
malteos Jun 11, 2025
3e291f3
model: Add annamodels/LGAI-Embedding-Preview (#2810)
annamodels Jun 11, 2025
56dc620
fix: Ensure bright uses the correct revision (#2812)
KennethEnevoldsen Jun 11, 2025
9fc0c3d
1.38.29
invalid-email-address Jun 11, 2025
04c9511
add description to issue template (#2817)
Samoed Jun 15, 2025
03e084b
model: Added 3 HIT-TMG's KaLM-embedding models (#2478)
ayush1298 Jun 15, 2025
c790269
fix: Reuploaded previously unavailable SNL datasets (#2819)
KennethEnevoldsen Jun 16, 2025
74d17b2
Update tasks & benchmarks tables
github-actions[bot] Jun 16, 2025
dcdc16a
1.38.30
invalid-email-address Jun 16, 2025
774a942
docs: Fix some typos in `docs/usage/usage.md` (#2835)
sadra-barikbin Jun 19, 2025
d7ff1ab
model: Add custom instructions for GigaEmbeddings (#2836)
ekolodin Jun 20, 2025
8851bf0
model: add Seed-1.6-embedding model (#2841)
QuanYuhan Jun 25, 2025
9a800d3
fix: Update model selection for the leaderboard (#2855)
KennethEnevoldsen Jun 25, 2025
642898f
1.38.31
invalid-email-address Jun 25, 2025
a8214e2
fix: update training dataset info of Seed-1.6-embedding model (#2857)
QuanYuhan Jun 25, 2025
82844eb
1.38.32
invalid-email-address Jun 25, 2025
f1d560a
add jinav4 model meta (#2858)
makram93 Jun 27, 2025
430357c
fix: prompt validation for tasks with `-` (#2846)
Samoed Jun 27, 2025
9fed3e5
1.38.33
invalid-email-address Jun 27, 2025
e3286d5
model: Adding Sailesh97/Hinvec (#2842)
SaileshP97 Jun 28, 2025
a4388c2
Bump gradio to fix leaderboard sorting (#2866)
Samoed Jun 28, 2025
4ff1413
model: Adding nvidia/llama-nemoretriever-colembed models (#2861)
bschifferer Jun 29, 2025
f27648b
rename seed-1.6-embedding to seed1.6-embedding (#2870)
QuanYuhan Jul 1, 2025
f346a37
fix tests to be compatible with `SentenceTransformers` `v5` (#2875)
Samoed Jul 2, 2025
5846f56
model: add listconranker modelmeta (#2874)
tutuDoki Jul 3, 2025
b67bd04
model: add kalm_models ModelMeta (new PR) (#2853)
YanshekWoo Jul 3, 2025
a3ca95c
Comment kalm model (#2877)
Samoed Jul 4, 2025
70768b5
Add and fix some Japanese datasets: ANLP datasets, JaCWIR, JQaRA (#2872)
lsz05 Jul 4, 2025
5be02c1
Update tasks & benchmarks tables
github-actions[bot] Jul 4, 2025
04dc6d4
model: add Hakim and TookaSBERTV2 models (#2826)
mehran-sarmadi Jul 4, 2025
ee17a6e
dataset: Evalita dataset integration (#2859)
MattiaSangermano Jul 7, 2025
5303fec
Update tasks & benchmarks tables
github-actions[bot] Jul 7, 2025
00c95cf
fix: pin datasets version (#2892)
Samoed Jul 10, 2025
cfa27d7
1.38.34
invalid-email-address Jul 10, 2025
c8c0d32
Merge branch 'main' into merge_main_v2_07_10
Samoed Jul 10, 2025
0b6fcae
fix model implementations
Samoed Jul 10, 2025
141fca0
fix tasks
Samoed Jul 10, 2025
8285279
add metrics
Samoed Jul 10, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
76 changes: 45 additions & 31 deletions docs/tasks.md

Large diffs are not rendered by default.

25 changes: 12 additions & 13 deletions docs/usage/usage.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Usage

This usage documentation follows a structure similar first it introduces a simple example of how to evaluate a model in MTEB.
Then introduces model detailed section of defining model, selecting tasks and running the evaluation. Each section contain subsection pertaining to
Then introduces model detailed section of defining model, selecting tasks and running the evaluation. Each section contains subsections pertaining to
these.


Expand All @@ -28,10 +28,10 @@ For instance if we want to run [`"sentence-transformers/all-MiniLM-L6-v2"`](http
```python
model_name = "sentence-transformers/all-MiniLM-L6-v2"

# or using SentenceTransformers
model = SentenceTransformers(model_name)
# load the model using MTEB
model = mteb.get_model(model_name) # will default to SentenceTransformers(model_name) if not implemented in MTEB
# or using SentenceTransformers
model = SentenceTransformers(model_name)

# select the desired tasks and evaluate
tasks = mteb.get_tasks(tasks=["Banking77Classification"])
Expand Down Expand Up @@ -59,7 +59,7 @@ MTEB is not only text evaluating, but also allow you to evaluate image and image
> [!NOTE]
> Running MTEB on images requires you to install the optional dependencies using `pip install mteb[image]`

To evaluate image embeddings you can follows the same approach for any other task in `mteb`. Simply ensuring that the task contains the modality "image":
To evaluate image embeddings you can follow the same approach for any other task in `mteb`. Simply ensuring that the task contains the modality "image":

```python
tasks = mteb.get_tasks(modalities=["image"]) # Only select tasks with image modalities
Expand Down Expand Up @@ -107,7 +107,7 @@ model = meta.load_model()
model = mteb.get_model(model_name)
```

You can get an overview of on the models available in `mteb` as follows:
You can get an overview of the models available in `mteb` as follows:

```py
model_metas = mteb.get_model_metas()
Expand All @@ -132,7 +132,7 @@ tasks = mteb.get_tasks(tasks=["Banking77Classification"])
results = mteb.evaluate(model, tasks=tasks)
```

However, we do recommend check in mteb include an implementation of the model before using sentence transformers since some models (e.g. the [multilingual e5 models](https://huggingface.co/collections/intfloat/multilingual-e5-text-embeddings-67b2b8bb9bff40dec9fb3534)) require a prompt and not specifying it may reduce performance.
However, we do recommend checking if mteb includes an implementation of the model before using sentence transformers since some models (e.g. the [multilingual e5 models](https://huggingface.co/collections/intfloat/multilingual-e5-text-embeddings-67b2b8bb9bff40dec9fb3534)) require a prompt and not specifying it may reduce performance.

> [!NOTE]
> If you want to evaluate a cross encoder on a reranking task, see section on [running cross encoders for reranking](#running-cross-encoders-on-reranking)
Expand All @@ -141,7 +141,7 @@ However, we do recommend check in mteb include an implementation of the model be

It is also possible to implement your own custom model in MTEB as long as it adheres to the [encoder interface](https://github.com/embeddings-benchmark/mteb/blob/main/mteb/encoder_interface.py#L21).

This entails implementing an `encode` function taking as inputs a list of sentences, and returning a list of embeddings (embeddings can be `np.array`, `torch.tensor`, etc.).
This entails implementing an `encode` function taking as input a list of sentences, and returning a list of embeddings (embeddings can be `np.array`, `torch.tensor`, etc.).

```python
import mteb
Expand Down Expand Up @@ -181,7 +181,7 @@ If you want to submit your implementation to be included in the leaderboard see

## Selecting Tasks

This section describes how to select benchmarks and task to evaluate, including selecting specific subsets or splits to run.
This section describes how to select benchmarks and tasks to evaluate, including selecting specific subsets or splits to run.

### Selecting a Benchmark

Expand All @@ -197,7 +197,7 @@ results = mteb.evaluate(model, tasks=benchmark)

The benchmark specified not only a list of tasks, but also what splits and language to run on.

To get an overview of all available benchmarks simply run:
To get an overview of all available benchmarks, simply run:

```python
import mteb
Expand All @@ -218,7 +218,7 @@ benchmark.citation

### Task selection

`mteb` comes the utility function `mteb.get_task` and `mteb_get_tasks` for fetching and analysing the tasks of interest.
`mteb` comes with the utility function `mteb.get_task` and `mteb_get_tasks` for fetching and analysing the tasks of interest.

This can be done in multiple ways, e.g.:

Expand Down Expand Up @@ -296,7 +296,7 @@ results = mteb.evaluate(model, tasks=[MyCustomTask()])

## Running the Evaluation

This section contain documentation related to the runtime of the evalution. How to pass arguments to the encoder, saving outputs and similar.
This section contains documentation related to the runtime of the evaluation. How to pass arguments to the encoder, saving outputs and similar.


### Introduction to `mteb.evaluate()`
Expand All @@ -307,7 +307,6 @@ Evalauting models in `mteb` typically takes the simple form:
results = mteb.evaluate(model, tasks=tasks)
```


### Specifying the cache

By default `mteb` with save the results in cache folder located at `~/.cache/mteb`, however if you want to saving the results in a specific folder you
Expand Down Expand Up @@ -360,7 +359,7 @@ In prompts the key can be:
8. `STS`
9. `Summarization`
10. `InstructionRetrieval`
3. Pair of task type and prompt type like `Retrival-query` - these prompts will be used in all classification tasks
3. Pair of task type and prompt type like `Retrieval-query` - these prompts will be used in all Retrieval tasks
4. Task name - these prompts will be used in the specific task
5. Pair of task name and prompt type like `NFCorpus-query` - these prompts will be used in the specific task

Expand Down
2 changes: 1 addition & 1 deletion mteb/abstasks/AbsTaskReranking.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@

logger = logging.getLogger(__name__)

OLD_FORMAT_RERANKING_TASKS = []
OLD_FORMAT_RERANKING_TASKS = ["JQaRAReranking", "JaCWIRReranking", "XGlueWPRReranking"]


class AbsTaskReranking(AbsTaskRetrieval):
Expand Down
1 change: 1 addition & 0 deletions mteb/abstasks/task_metadata.py
Original file line number Diff line number Diff line change
Expand Up @@ -195,6 +195,7 @@ class MetadataDatasetDict(TypedDict, total=False):
name: str
split: str
trust_remote_code: bool
dataset_version: str # NLPJournalAbsArticleRetrieval.V2


class TaskMetadata(BaseModel):
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
{
"test": {
"num_samples": 1147,
"number_of_characters": 1607635,
"num_documents": 637,
"min_document_length": 304,
"average_document_length": 2148.0376766091053,
"max_document_length": 9565,
"unique_documents": 637,
"num_queries": 510,
"min_query_length": 18,
"average_query_length": 469.2843137254902,
"max_query_length": 1290,
"unique_queries": 510,
"none_queries": 0,
"num_relevant_docs": 510,
"min_relevant_docs_per_query": 1,
"average_relevant_docs_per_query": 1.0,
"max_relevant_docs_per_query": 1,
"unique_relevant_docs": 510,
"num_instructions": null,
"min_instruction_length": null,
"average_instruction_length": null,
"max_instruction_length": null,
"unique_instructions": null,
"num_top_ranked": null,
"min_top_ranked_per_query": null,
"average_top_ranked_per_query": null,
"max_top_ranked_per_query": null
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
{
"test": {
"num_samples": 1147,
"number_of_characters": 308305,
"num_documents": 637,
"min_document_length": 18,
"average_document_length": 461.51962323390893,
"max_document_length": 1290,
"unique_documents": 637,
"num_queries": 510,
"min_query_length": 5,
"average_query_length": 28.072549019607845,
"max_query_length": 71,
"unique_queries": 510,
"none_queries": 0,
"num_relevant_docs": 510,
"min_relevant_docs_per_query": 1,
"average_relevant_docs_per_query": 1.0,
"max_relevant_docs_per_query": 1,
"unique_relevant_docs": 510,
"num_instructions": null,
"min_instruction_length": null,
"average_instruction_length": null,
"max_instruction_length": null,
"unique_instructions": null,
"num_top_ranked": null,
"min_top_ranked_per_query": null,
"average_top_ranked_per_query": null,
"max_top_ranked_per_query": null
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
{
"test": {
"num_samples": 1147,
"number_of_characters": 1382617,
"num_documents": 637,
"min_document_length": 304,
"average_document_length": 2148.0376766091053,
"max_document_length": 9565,
"unique_documents": 637,
"num_queries": 510,
"min_query_length": 5,
"average_query_length": 28.072549019607845,
"max_query_length": 71,
"unique_queries": 510,
"none_queries": 0,
"num_relevant_docs": 510,
"min_relevant_docs_per_query": 1,
"average_relevant_docs_per_query": 1.0,
"max_relevant_docs_per_query": 1,
"unique_relevant_docs": 510,
"num_instructions": null,
"min_instruction_length": null,
"average_instruction_length": null,
"max_instruction_length": null,
"unique_instructions": null,
"num_top_ranked": null,
"min_top_ranked_per_query": null,
"average_top_ranked_per_query": null,
"max_top_ranked_per_query": null
}
}
Loading