embeddings-benchmark · Samoed · Jul 10, 2025 · Jun 8, 2025 · Jun 9, 2025 · Jun 9, 2025
diff --git a/docs/tasks.md b/docs/tasks.md
diff --git a/docs/usage/usage.md b/docs/usage/usage.md
@@ -1,7 +1,7 @@
 # Usage
 
 This usage documentation follows a structure similar first it introduces a simple example of how to evaluate a model in MTEB.
-Then introduces model detailed section of defining model, selecting tasks and running the evaluation. Each section contain subsection pertaining to
+Then introduces model detailed section of defining model, selecting tasks and running the evaluation. Each section contains subsections pertaining to
 these.
 
 
@@ -28,10 +28,10 @@ For instance if we want to run [`"sentence-transformers/all-MiniLM-L6-v2"`](http
 ```python
 model_name = "sentence-transformers/all-MiniLM-L6-v2"
 
-# or using SentenceTransformers
-model = SentenceTransformers(model_name)
 # load the model using MTEB
 model = mteb.get_model(model_name) # will default to SentenceTransformers(model_name) if not implemented in MTEB
+# or using SentenceTransformers
+model = SentenceTransformers(model_name)
 
 # select the desired tasks and evaluate
 tasks = mteb.get_tasks(tasks=["Banking77Classification"])
@@ -59,7 +59,7 @@ MTEB is not only text evaluating, but also allow you to evaluate image and image
 > [!NOTE]
 > Running MTEB on images requires you to install the optional dependencies using `pip install mteb[image]`
 
-To evaluate image embeddings you can follows the same approach for any other task in `mteb`. Simply ensuring that the task contains the modality "image":
+To evaluate image embeddings you can follow the same approach for any other task in `mteb`. Simply ensuring that the task contains the modality "image":
 
 ```python
 tasks = mteb.get_tasks(modalities=["image"]) # Only select tasks with image modalities
@@ -107,7 +107,7 @@ model = meta.load_model()
 model = mteb.get_model(model_name)
 ```
 
-You can get an overview of on the models available in `mteb` as follows:
+You can get an overview of the models available in `mteb` as follows:
 
 ```py
 model_metas = mteb.get_model_metas()
@@ -132,7 +132,7 @@ tasks = mteb.get_tasks(tasks=["Banking77Classification"])
 results = mteb.evaluate(model, tasks=tasks)
 ```
 
-However, we do recommend check in mteb include an implementation of the model before using sentence transformers since some models (e.g. the [multilingual e5 models](https://huggingface.co/collections/intfloat/multilingual-e5-text-embeddings-67b2b8bb9bff40dec9fb3534)) require a prompt and not specifying it may reduce performance.
+However, we do recommend checking if mteb includes an implementation of the model before using sentence transformers since some models (e.g. the [multilingual e5 models](https://huggingface.co/collections/intfloat/multilingual-e5-text-embeddings-67b2b8bb9bff40dec9fb3534)) require a prompt and not specifying it may reduce performance.
 
 > [!NOTE]
 > If you want to evaluate a cross encoder on a reranking task, see section on [running cross encoders for reranking](#running-cross-encoders-on-reranking)
@@ -141,7 +141,7 @@ However, we do recommend check in mteb include an implementation of the model be
 
 It is also possible to implement your own custom model in MTEB as long as it adheres to the [encoder interface](https://github.com/embeddings-benchmark/mteb/blob/main/mteb/encoder_interface.py#L21).
 
-This entails implementing an `encode` function taking as inputs a list of sentences, and returning a list of embeddings (embeddings can be `np.array`, `torch.tensor`, etc.).
+This entails implementing an `encode` function taking as input a list of sentences, and returning a list of embeddings (embeddings can be `np.array`, `torch.tensor`, etc.).
 
 ```python
 import mteb
@@ -181,7 +181,7 @@ If you want to submit your implementation to be included in the leaderboard see
 
 ## Selecting Tasks
 
-This section describes how to select benchmarks and task to evaluate, including selecting specific subsets or splits to run.
+This section describes how to select benchmarks and tasks to evaluate, including selecting specific subsets or splits to run.
 
 ### Selecting a Benchmark
 
@@ -197,7 +197,7 @@ results = mteb.evaluate(model, tasks=benchmark)
 
 The benchmark specified not only a list of tasks, but also what splits and language to run on.
 
-To get an overview of all available benchmarks simply run:
+To get an overview of all available benchmarks, simply run:
 
 ```python
 import mteb
@@ -218,7 +218,7 @@ benchmark.citation
 
 ### Task selection
 
-`mteb` comes the utility function `mteb.get_task` and `mteb_get_tasks` for fetching and analysing the tasks of interest.
+`mteb` comes with the utility function `mteb.get_task` and `mteb_get_tasks` for fetching and analysing the tasks of interest.
 
 This can be done in multiple ways, e.g.:
 
@@ -296,7 +296,7 @@ results = mteb.evaluate(model, tasks=[MyCustomTask()])
 
 ## Running the Evaluation
 
-This section contain documentation related to the runtime of the evalution. How to pass arguments to the encoder, saving outputs and similar.
+This section contains documentation related to the runtime of the evaluation. How to pass arguments to the encoder, saving outputs and similar.
 
 
 ### Introduction to `mteb.evaluate()`
@@ -307,7 +307,6 @@ Evalauting models in `mteb` typically takes the simple form:
 results = mteb.evaluate(model, tasks=tasks)
 ```
 
-
 ### Specifying the cache
 
 By default `mteb` with save the results in cache folder located at `~/.cache/mteb`, however if you want to saving the results in a specific folder you
@@ -360,7 +359,7 @@ In prompts the key can be:
    8. `STS`
    9. `Summarization`
    10. `InstructionRetrieval`
-3. Pair of task type and prompt type like `Retrival-query` - these prompts will be used in all classification tasks
+3. Pair of task type and prompt type like `Retrieval-query` - these prompts will be used in all Retrieval tasks
 4. Task name - these prompts will be used in the specific task
 5. Pair of task name and prompt type like `NFCorpus-query` - these prompts will be used in the specific task
 

diff --git a/mteb/abstasks/AbsTaskReranking.py b/mteb/abstasks/AbsTaskReranking.py
@@ -10,7 +10,7 @@
 
 logger = logging.getLogger(__name__)
 
-OLD_FORMAT_RERANKING_TASKS = []
+OLD_FORMAT_RERANKING_TASKS = ["JQaRAReranking", "JaCWIRReranking", "XGlueWPRReranking"]
 
 
 class AbsTaskReranking(AbsTaskRetrieval):

diff --git a/mteb/abstasks/task_metadata.py b/mteb/abstasks/task_metadata.py
@@ -195,6 +195,7 @@ class MetadataDatasetDict(TypedDict, total=False):
     name: str
     split: str
     trust_remote_code: bool
+    dataset_version: str  # NLPJournalAbsArticleRetrieval.V2
 
 
 class TaskMetadata(BaseModel):

diff --git a/mteb/descriptive_stats/Retrieval/NLPJournalAbsIntroRetrieval.V2.json b/mteb/descriptive_stats/Retrieval/NLPJournalAbsIntroRetrieval.V2.json
@@ -0,0 +1,31 @@
+{
+    "test": {
+        "num_samples": 1147,
+        "number_of_characters": 1607635,
+        "num_documents": 637,
+        "min_document_length": 304,
+        "average_document_length": 2148.0376766091053,
+        "max_document_length": 9565,
+        "unique_documents": 637,
+        "num_queries": 510,
+        "min_query_length": 18,
+        "average_query_length": 469.2843137254902,
+        "max_query_length": 1290,
+        "unique_queries": 510,
+        "none_queries": 0,
+        "num_relevant_docs": 510,
+        "min_relevant_docs_per_query": 1,
+        "average_relevant_docs_per_query": 1.0,
+        "max_relevant_docs_per_query": 1,
+        "unique_relevant_docs": 510,
+        "num_instructions": null,
+        "min_instruction_length": null,
+        "average_instruction_length": null,
+        "max_instruction_length": null,
+        "unique_instructions": null,
+        "num_top_ranked": null,
+        "min_top_ranked_per_query": null,
+        "average_top_ranked_per_query": null,
+        "max_top_ranked_per_query": null
+    }
+}
diff --git a/mteb/descriptive_stats/Retrieval/NLPJournalTitleAbsRetrieval.V2.json b/mteb/descriptive_stats/Retrieval/NLPJournalTitleAbsRetrieval.V2.json
@@ -0,0 +1,31 @@
+{
+    "test": {
+        "num_samples": 1147,
+        "number_of_characters": 308305,
+        "num_documents": 637,
+        "min_document_length": 18,
+        "average_document_length": 461.51962323390893,
+        "max_document_length": 1290,
+        "unique_documents": 637,
+        "num_queries": 510,
+        "min_query_length": 5,
+        "average_query_length": 28.072549019607845,
+        "max_query_length": 71,
+        "unique_queries": 510,
+        "none_queries": 0,
+        "num_relevant_docs": 510,
+        "min_relevant_docs_per_query": 1,
+        "average_relevant_docs_per_query": 1.0,
+        "max_relevant_docs_per_query": 1,
+        "unique_relevant_docs": 510,
+        "num_instructions": null,
+        "min_instruction_length": null,
+        "average_instruction_length": null,
+        "max_instruction_length": null,
+        "unique_instructions": null,
+        "num_top_ranked": null,
+        "min_top_ranked_per_query": null,
+        "average_top_ranked_per_query": null,
+        "max_top_ranked_per_query": null
+    }
+}
diff --git a/mteb/descriptive_stats/Retrieval/NLPJournalTitleIntroRetrieval.V2.json b/mteb/descriptive_stats/Retrieval/NLPJournalTitleIntroRetrieval.V2.json
@@ -0,0 +1,31 @@
+{
+    "test": {
+        "num_samples": 1147,
+        "number_of_characters": 1382617,
+        "num_documents": 637,
+        "min_document_length": 304,
+        "average_document_length": 2148.0376766091053,
+        "max_document_length": 9565,
+        "unique_documents": 637,
+        "num_queries": 510,
+        "min_query_length": 5,
+        "average_query_length": 28.072549019607845,
+        "max_query_length": 71,
+        "unique_queries": 510,
+        "none_queries": 0,
+        "num_relevant_docs": 510,
+        "min_relevant_docs_per_query": 1,
+        "average_relevant_docs_per_query": 1.0,
+        "max_relevant_docs_per_query": 1,
+        "unique_relevant_docs": 510,
+        "num_instructions": null,
+        "min_instruction_length": null,
+        "average_instruction_length": null,
+        "max_instruction_length": null,
+        "unique_instructions": null,
+        "num_top_ranked": null,
+        "min_top_ranked_per_query": null,
+        "average_top_ranked_per_query": null,
+        "max_top_ranked_per_query": null
+    }
+}