pip install mteb
- Using a python script (see scripts/run_mteb_english.py and mteb/mtebscripts for more):
from mteb import MTEB
from sentence_transformers import SentenceTransformer
# Define the sentence-transformers model name
model_name = "average_word_embeddings_komninos"
model = SentenceTransformer(model_name)
evaluation = MTEB(tasks=["Banking77Classification"])
results = evaluation.run(model, output_folder=f"results/{model_name}")
- Using CLI
mteb --available_tasks
mteb -m average_word_embeddings_komninos \
-t Banking77Classification \
--output_folder results/average_word_embeddings_komninos \
--verbosity 3
- Using multiple GPUs in parallel can be done by just having a custom encode function that distributes the inputs to multiple GPUs like e.g. here. For retrieval tasks you can also use the below (see scripts/retrieval.slurm for multi-node slurm script example):
pip install git+https://github.com/NouamaneTazi/beir@nouamane/better-multi-gpu
# Run on 2 gpus
torchrun --nproc_per_node=2 scripts/retrieval_multigpu.py
Datasets can be selected by providing the list of datasets, but also
- by their task (e.g. "Clustering" or "Classification")
evaluation = MTEB(task_types=['Clustering', 'Retrieval']) # Only select clustering and retrieval tasks
- by their categories e.g. "S2S" (sentence to sentence) or "P2P" (paragraph to paragraph)
evaluation = MTEB(task_categories=['S2S']) # Only select sentence2sentence datasets
- by their languages
evaluation = MTEB(task_langs=["en", "de"]) # Only select datasets which are "en", "de" or "en-de"
You can also specify which languages to load for multilingual/crosslingual tasks like below:
from mteb.tasks import AmazonReviewsClassification, BUCCBitextMining
evaluation = MTEB(tasks=[
AmazonReviewsClassification(langs=["en", "fr"]) # Only load "en" and "fr" subsets of Amazon Reviews
BUCCBitextMining(langs=["de-en"]), # Only load "de-en" subset of BUCC
])
You can evaluate only on test
splits of all tasks by doing the following:
evaluation.run(model, eval_splits=["test"])
Note that the public leaderboard uses the test splits for all datasets except MSMARCO, where the "dev" split is used.
Models should implement the following interface, implementing an encode
function taking as inputs a list of sentences, and returning a list of embeddings (embeddings can be np.array
, torch.tensor
, etc.). For inspiration, you can look at the mteb/mtebscripts repo used for running diverse models via SLURM scripts for the paper.
class MyModel():
def encode(self, sentences, batch_size=32, **kwargs):
"""
Returns a list of embeddings for the given sentences.
Args:
sentences (`List[str]`): List of sentences to encode
batch_size (`int`): Batch size for the encoding
Returns:
`List[np.ndarray]` or `List[tensor]`: List of embeddings for the given sentences
"""
pass
model = MyModel()
evaluation = MTEB(tasks=["Banking77Classification"])
evaluation.run(model)
If you'd like to use different encoding functions for query and corpus when evaluating on Retrieval or Reranking tasks, you can add separate methods for encode_queries
and encode_corpus
. If these methods exist, they will be automatically used for those tasks. You can refer to the DRESModel
at mteb/mteb/abstasks/AbsTaskRetrieval.py
for an example of these functions.
class MyModel():
def encode_queries(self, queries, batch_size=32, **kwargs):
"""
Returns a list of embeddings for the given sentences.
Args:
queries (`List[str]`): List of sentences to encode
batch_size (`int`): Batch size for the encoding
Returns:
`List[np.ndarray]` or `List[tensor]`: List of embeddings for the given sentences
"""
pass
def encode_corpus(self, corpus, batch_size=32, **kwargs):
"""
Returns a list of embeddings for the given sentences.
Args:
corpus (`List[str]` or `List[Dict[str, str]]`): List of sentences to encode
or list of dictionaries with keys "title" and "text"
batch_size (`int`): Batch size for the encoding
Returns:
`List[np.ndarray]` or `List[tensor]`: List of embeddings for the given sentences
"""
pass
To add a new task, you need to implement a new class that inherits from the AbsTask
associated with the task type (e.g. AbsTaskReranking
for reranking tasks). You can find the supported task types in here.
from mteb import MTEB
from mteb.abstasks.AbsTaskReranking import AbsTaskReranking
from sentence_transformers import SentenceTransformer
class MindSmallReranking(AbsTaskReranking):
@property
def description(self):
return {
"name": "MindSmallReranking",
"hf_hub_name": "mteb/mind_small",
"description": "Microsoft News Dataset: A Large-Scale English Dataset for News Recommendation Research",
"reference": "https://www.microsoft.com/en-us/research/uploads/prod/2019/03/nl4se18LinkSO.pdf",
"type": "Reranking",
"category": "s2s",
"eval_splits": ["validation"],
"eval_langs": ["en"],
"main_score": "map",
}
model = SentenceTransformer("average_word_embeddings_komninos")
evaluation = MTEB(tasks=[MindSmallReranking()])
evaluation.run(model)
Note: for multilingual tasks, make sure your class also inherits from the
MultilingualTask
class like in this example.
The MTEB Leaderboard is available here. To submit:
- Run on MTEB: You can reference scripts/run_mteb_english.py for all MTEB English datasets used in the main ranking, or scripts/run_mteb_chinese.py for the Chinese ones. Advanced scripts with different models are available in the mteb/mtebscripts repo.
- Format the json files into metadata using the script at
scripts/mteb_meta.py
. For examplepython scripts/mteb_meta.py path_to_results_folder
, which will create amteb_metadata.md
file. If you ran CQADupstack retrieval, make sure to merge the results first withpython scripts/merge_cqadupstack.py path_to_results_folder
. - Copy the content of the
mteb_metadata.md
file to the top of aREADME.md
file of your model on the Hub. See here for an example. - Hit the Refresh button at the bottom of the leaderboard and you should see your scores 🥇
- To have the scores appear without refreshing, you can open an issue on the Community Tab of the LB and someone will restart the space to cache your average scores. The cache is updated anyways ~1x/week.
Name | Hub URL | Description | Type | Category | #Languages | Train #Samples | Dev #Samples | Test #Samples | Avg. chars / train | Avg. chars / dev | Avg. chars / test |
---|---|---|---|---|---|---|---|---|---|---|---|
BUCC | mteb/bucc-bitext-mining | BUCC bitext mining dataset | BitextMining | s2s | 4 | 0 | 0 | 641684 | 0 | 0 | 101.3 |
Tatoeba | mteb/tatoeba-bitext-mining | 1,000 English-aligned sentence pairs for each language based on the Tatoeba corpus | BitextMining | s2s | 112 | 0 | 0 | 2000 | 0 | 0 | 39.4 |
Bornholm parallel | strombergnlp/bornholmsk_parallel | Danish Bornholmsk Parallel Corpus. | BitextMining | s2s | 2 | 100 | 100 | 100 | 64.6 | 86.2 | 89.7 |
AmazonCounterfactualClassification | mteb/amazon_counterfactual | A collection of Amazon customer reviews annotated for counterfactual detection pair classification. | Classification | s2s | 4 | 4018 | 335 | 670 | 107.3 | 109.2 | 106.1 |
AmazonPolarityClassification | mteb/amazon_polarity | Amazon Polarity Classification Dataset. | Classification | s2s | 1 | 3600000 | 0 | 400000 | 431.6 | 0 | 431.4 |
AmazonReviewsClassification | mteb/amazon_reviews_multi | A collection of Amazon reviews specifically designed to aid research in multilingual text classification. | Classification | s2s | 6 | 1200000 | 30000 | 30000 | 160.5 | 159.2 | 160.4 |
Banking77Classification | mteb/banking77 | Dataset composed of online banking queries annotated with their corresponding intents. | Classification | s2s | 1 | 10003 | 0 | 3080 | 59.5 | 0 | 54.2 |
EmotionClassification | mteb/emotion | Emotion is a dataset of English Twitter messages with six basic emotions: anger, fear, joy, love, sadness, and surprise. For more detailed information please refer to the paper. | Classification | s2s | 1 | 16000 | 2000 | 2000 | 96.8 | 95.3 | 96.6 |
ImdbClassification | mteb/imdb | Large Movie Review Dataset | Classification | p2p | 1 | 25000 | 0 | 25000 | 1325.1 | 0 | 1293.8 |
MassiveIntentClassification | mteb/amazon_massive_intent | MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages | Classification | s2s | 51 | 11514 | 2033 | 2974 | 35.0 | 34.8 | 34.6 |
MassiveScenarioClassification | mteb/amazon_massive_scenario | MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages | Classification | s2s | 51 | 11514 | 2033 | 2974 | 35.0 | 34.8 | 34.6 |
MTOPDomainClassification | mteb/mtop_domain | MTOP: Multilingual Task-Oriented Semantic Parsing | Classification | s2s | 6 | 15667 | 2235 | 4386 | 36.6 | 36.5 | 36.8 |
MTOPIntentClassification | mteb/mtop_intent | MTOP: Multilingual Task-Oriented Semantic Parsing | Classification | s2s | 6 | 15667 | 2235 | 4386 | 36.6 | 36.5 | 36.8 |
ToxicConversationsClassification | mteb/toxic_conversations_50k | Collection of comments from the Civil Comments platform together with annotations if the comment is toxic or not. | Classification | s2s | 1 | 50000 | 0 | 50000 | 298.8 | 0 | 296.6 |
TweetSentimentExtractionClassification | mteb/tweet_sentiment_extraction | Classification | s2s | 1 | 27481 | 0 | 3534 | 68.3 | 0 | 67.8 | |
AngryTweetsClassification | mteb/DDSC/angry-tweets | A sentiment dataset with 3 classes (positiv, negativ, neutral) for Danish tweets | Classification | s2s | 1 | 2410 | 0 | 1050 | 153.0 | 0 | 156.1 |
DKHateClassification | DDSC/dkhate | Danish Tweets annotated for Hate Speech | Classification | s2s | 1 | 2960 | 0 | 329 | 88.2 | 0 | 104.0 |
DalajClassification | AI-Sweden/SuperLim | A Swedish dataset for linguistic accebtablity. Available as a part of Superlim | Classification | s2s | 1 | 3840 | 445 | 444 | 243.7 | 242.5 | 243.8 |
DanishPoliticalCommentsClassification | danish_political_comments | A dataset of Danish political comments rated for sentiment | Classification | s2s | 1 | 9010 | 0 | 0 | 69.9 | 0 | 0 |
LccClassification | DDSC/lcc | The leipzig corpora collection, annotated for sentiment | Classification | s2s | 1 | 349 | 0 | 150 | 113.5 | 0 | 118.7 |
NoRecClassification | ScandEval/norec-mini | A Norwegian dataset for sentiment classification on review | Classification | s2s | 1 | 1020 | 256 | 2050 | 86.9 | 89.6 | 82.0 |
NordicLangClassification | strombergnlp/nordic_langid | A dataset for Nordic language identification. | Classification | s2s | 6 | 57000 | 0 | 3000 | 78.4 | 0 | 78.2 |
NorwegianParliamentClassification | NbAiLab/norwegian_parliament | Norwegian parliament speeches annotated for sentiment | Classification | s2s | 1 | 3600 | 1200 | 1200 | 1773.6 | 1911.0 | 1884.0 |
ScalaDaClassification | ScandEval/scala-da | A modified version of DDT modified for linguistic acceptability classification | Classification | s2s | 1 | 1024 | 256 | 2048 | 107.6 | 100.8 | 109.4 |
ScalaNbClassification | ScandEval/scala-nb | A Norwegian dataset for linguistic acceptability classification for Bokmål | Classification | s2s | 1 | 1024 | 256 | 2048 | 95.5 | 94.8 | 98.4 |
ScalaNnClassification | ScandEval/scala-nn | A Norwegian dataset for linguistic acceptability classification for Nynorsk | Classification | s2s | 1 | 1024 | 256 | 2048 | 105.3 | 103.5 | 104.8 |
ScalaSvClassification | ScandEval/scala-sv | A Swedish dataset for linguistic acceptability classification | Classification | s2s | 1 | 1024 | 256 | 2048 | 102.6 | 113.0 | 98.3 |
SweRecClassificition | ScandEval/swerec-mini | A Swedish dataset for sentiment classification on reviews | Classification | s2s | 1 | 1024 | 256 | 2048 | 317.7 | 293.4 | 318.8 |
CBD | PL-MTEB/cbd | Polish Tweets annotated for cyberbullying detection. | Classification | s2s | 1 | 10041 | 0 | 1000 | 93.6 | 0 | 93.2 |
PolEmo2.0-IN | PL-MTEB/polemo2_in | A collection of Polish online reviews from four domains: medicine, hotels, products and school. The PolEmo2.0-IN task is to predict the sentiment of in-domain (medicine and hotels) reviews. | Classification | s2s | 1 | 5783 | 723 | 722 | 780.6 | 769.4 | 756.2 |
PolEmo2.0-OUT | PL-MTEB/polemo2_out | A collection of Polish online reviews from four domains: medicine, hotels, products and school. The PolEmo2.0-OUT task is to predict the sentiment of out-of-domain (products and school) reviews using models train on reviews from medicine and hotels domains. | Classification | s2s | 1 | 5783 | 494 | 494 | 780.6 | 589.3 | 587.0 |
AllegroReviews | PL-MTEB/allegro-reviews | A Polish dataset for sentiment classification on reviews from e-commerce marketplace Allegro. | Classification | s2s | 1 | 9577 | 1002 | 1006 | 477.9 | 480.9 | 477.2 |
PAC | laugustyniak/abusive-clauses-pl | Polish Abusive Clauses Dataset | Classification | s2s | 1 | 4284 | 1519 | 3453 | 185.3 | 256.8 | 185.3 |
ArxivClusteringP2P | mteb/arxiv-clustering-p2p | Clustering of titles+abstract from arxiv. Clustering of 30 sets, either on the main or secondary category | Clustering | p2p | 1 | 0 | 0 | 732723 | 0 | 0 | 1009.9 |
ArxivClusteringS2S | mteb/arxiv-clustering-s2s | Clustering of titles from arxiv. Clustering of 30 sets, either on the main or secondary category | Clustering | s2s | 1 | 0 | 0 | 732723 | 0 | 0 | 74.0 |
BiorxivClusteringP2P | mteb/biorxiv-clustering-p2p | Clustering of titles+abstract from biorxiv. Clustering of 10 sets, based on the main category. | Clustering | p2p | 1 | 0 | 0 | 75000 | 0 | 0 | 1666.2 |
BiorxivClusteringS2S | mteb/biorxiv-clustering-s2s | Clustering of titles from biorxiv. Clustering of 10 sets, based on the main category. | Clustering | s2s | 1 | 0 | 0 | 75000 | 0 | 0 | 101.6 |
BlurbsClusteringP2P | slvnwhrl/blurbs-clustering-p2p | Clustering of book titles+blurbs. Clustering of 28 sets, either on the main or secondary genre | Clustering | p2p | 1 | 0 | 0 | 174637 | 0 | 0 | 664.09 |
BlurbsClusteringS2S | slvnwhrl/blurbs-clustering-s2s | Clustering of book titles. Clustering of 28 sets, either on the main or secondary genre. | Clustering | s2s | 1 | 0 | 0 | 174637 | 0 | 0 | 23.02 |
MedrxivClusteringP2P | mteb/medrxiv-clustering-p2p | Clustering of titles+abstract from medrxiv. Clustering of 10 sets, based on the main category. | Clustering | p2p | 1 | 0 | 0 | 37500 | 0 | 0 | 1981.2 |
MedrxivClusteringS2S | mteb/medrxiv-clustering-s2s | Clustering of titles from medrxiv. Clustering of 10 sets, based on the main category. | Clustering | s2s | 1 | 0 | 0 | 37500 | 0 | 0 | 114.7 |
RedditClustering | mteb/reddit-clustering | Clustering of titles from 199 subreddits. Clustering of 25 sets, each with 10-50 classes, and each class with 100 - 1000 sentences. | Clustering | s2s | 1 | 0 | 0 | 420464 | 0 | 0 | 64.7 |
RedditClusteringP2P | mteb/reddit-clustering-p2p | Clustering of title+posts from reddit. Clustering of 10 sets of 50k paragraphs and 40 sets of 10k paragraphs. | Clustering | p2p | 1 | 0 | 0 | 459399 | 0 | 0 | 727.7 |
StackExchangeClustering | mteb/stackexchange-clustering | Clustering of titles from 121 stackexchanges. Clustering of 25 sets, each with 10-50 classes, and each class with 100 - 1000 sentences. | Clustering | s2s | 1 | 0 | 417060 | 373850 | 0 | 56.8 | 57.0 |
StackExchangeClusteringP2P | mteb/stackexchange-clustering-p2p | Clustering of title+body from stackexchange. Clustering of 5 sets of 10k paragraphs and 5 sets of 5k paragraphs. | Clustering | p2p | 1 | 0 | 0 | 75000 | 0 | 0 | 1090.7 |
TenKGnadClusteringP2P | slvnwhrl/tenkgnad-clustering-p2p | Clustering of news article titles+subheadings+texts. Clustering of 10 splits on the news article category. | Clustering | p2p | 1 | 0 | 0 | 45914 | 0 | 0 | 2641.03 |
TenKGnadClusteringS2S | slvnwhrl/tenkgnad-clustering-s2s | Clustering of news article titles. Clustering of 10 splits on the news article category. | Clustering | s2s | 1 | 0 | 0 | 45914 | 0 | 0 | 50.96 |
TwentyNewsgroupsClustering | mteb/twentynewsgroups-clustering | Clustering of the 20 Newsgroups dataset (subject only). | Clustering | s2s | 1 | 0 | 0 | 59545 | 0 | 0 | 32.0 |
8TagsClustering | PL-MTEB/8tags-clustering | Clustering of headlines from social media posts in Polish belonging to 8 categories: film, history, food, medicine, motorization, work, sport and technology. | Clustering | s2s | 1 | 40001 | 5000 | 4372 | 78.2 | 77.6 | 79.2 |
SprintDuplicateQuestions | mteb/sprintduplicatequestions-pairclassification | Duplicate questions from the Sprint community. | PairClassification | s2s | 1 | 0 | 101000 | 101000 | 0 | 65.2 | 67.9 |
TwitterSemEval2015 | mteb/twittersemeval2015-pairclassification | Paraphrase-Pairs of Tweets from the SemEval 2015 workshop. | PairClassification | s2s | 1 | 0 | 0 | 16777 | 0 | 0 | 38.3 |
TwitterURLCorpus | mteb/twitterurlcorpus-pairclassification | Paraphrase-Pairs of Tweets. | PairClassification | s2s | 1 | 0 | 0 | 51534 | 0 | 0 | 79.5 |
PPC | PL-MTEB/ppc-pairclassification | Polish Paraphrase Corpus | PairClassification | s2s | 1 | 5000 | 1000 | 1000 | 41.0 | 41.0 | 40.2 |
PSC | PL-MTEB/psc-pairclassification | Polish Summaries Corpus | PairClassification | s2s | 1 | 4302 | 0 | 1078 | 537.1 | 0 | 549.3 |
SICK-E-PL | PL-MTEB/sicke-pl-pairclassification | Polish version of SICK dataset for textual entailment. | PairClassification | s2s | 1 | 4439 | 495 | 4906 | 43.4 | 44.7 | 43.2 |
CDSC-E | PL-MTEB/cdsce-pairclassification | Compositional Distributional Semantics Corpus for textual entailment. | PairClassification | s2s | 1 | 8000 | 1000 | 1000 | 71.9 | 73.5 | 75.2 |
AskUbuntuDupQuestions | mteb/askubuntudupquestions-reranking | AskUbuntu Question Dataset - Questions from AskUbuntu with manual annotations marking pairs of questions as similar or non-similar | Reranking | s2s | 1 | 0 | 0 | 2255 | 0 | 0 | 52.5 |
MindSmallReranking | mteb/mind_small | Microsoft News Dataset: A Large-Scale English Dataset for News Recommendation Research | Reranking | s2s | 1 | 231530 | 0 | 107968 | 69.0 | 0 | 70.9 |
SciDocsRR | mteb/scidocs-reranking | Ranking of related scientific papers based on their title. | Reranking | s2s | 1 | 0 | 19594 | 19599 | 0 | 69.4 | 69.0 |
StackOverflowDupQuestions | mteb/stackoverflowdupquestions-reranking | Stack Overflow Duplicate Questions Task for questions with the tags Java, JavaScript and Python | Reranking | s2s | 1 | 23018 | 0 | 3467 | 49.6 | 0 | 49.8 |
ArguAna | BeIR/arguana | NFCorpus: A Full-Text Learning to Rank Dataset for Medical Information Retrieval | Retrieval | p2p | 1 | 0 | 0 | 10080 | 0 | 0 | 1052.9 |
ClimateFEVER | BeIR/climate-fever | CLIMATE-FEVER is a dataset adopting the FEVER methodology that consists of 1,535 real-world claims regarding climate-change. | Retrieval | s2p | 1 | 0 | 0 | 5418128 | 0 | 0 | 539.1 |
CQADupstackAndroidRetrieval | BeIR/cqadupstack/android | CQADupStack: A Benchmark Data Set for Community Question-Answering Research | Retrieval | s2p | 1 | 0 | 0 | 23697 | 0 | 0 | 578.7 |
CQADupstackEnglishRetrieval | BeIR/cqadupstack/english | CQADupStack: A Benchmark Data Set for Community Question-Answering Research | Retrieval | s2p | 1 | 0 | 0 | 41791 | 0 | 0 | 467.1 |
CQADupstackGamingRetrieval | BeIR/cqadupstack/gaming | CQADupStack: A Benchmark Data Set for Community Question-Answering Research | Retrieval | s2p | 1 | 0 | 0 | 46896 | 0 | 0 | 474.7 |
CQADupstackGisRetrieval | BeIR/cqadupstack/gis | CQADupStack: A Benchmark Data Set for Community Question-Answering Research | Retrieval | s2p | 1 | 0 | 0 | 38522 | 0 | 0 | 991.1 |
CQADupstackMathematicaRetrieval | BeIR/cqadupstack/mathematica | CQADupStack: A Benchmark Data Set for Community Question-Answering Research | Retrieval | s2p | 1 | 0 | 0 | 17509 | 0 | 0 | 1103.7 |
CQADupstackPhysicsRetrieval | BeIR/cqadupstack/physics | CQADupStack: A Benchmark Data Set for Community Question-Answering Research | Retrieval | s2p | 1 | 0 | 0 | 39355 | 0 | 0 | 799.4 |
CQADupstackProgrammersRetrieval | BeIR/cqadupstack/programmers | CQADupStack: A Benchmark Data Set for Community Question-Answering Research | Retrieval | s2p | 1 | 0 | 0 | 33052 | 0 | 0 | 1030.2 |
CQADupstackStatsRetrieval | BeIR/cqadupstack/stats | CQADupStack: A Benchmark Data Set for Community Question-Answering Research | Retrieval | s2p | 1 | 0 | 0 | 42921 | 0 | 0 | 1041.0 |
CQADupstackTexRetrieval | BeIR/cqadupstack/tex | CQADupStack: A Benchmark Data Set for Community Question-Answering Research | Retrieval | s2p | 1 | 0 | 0 | 71090 | 0 | 0 | 1246.9 |
CQADupstackUnixRetrieval | BeIR/cqadupstack/unix | CQADupStack: A Benchmark Data Set for Community Question-Answering Research | Retrieval | s2p | 1 | 0 | 0 | 48454 | 0 | 0 | 984.7 |
CQADupstackWebmastersRetrieval | BeIR/cqadupstack/webmasters | CQADupStack: A Benchmark Data Set for Community Question-Answering Research | Retrieval | s2p | 1 | 0 | 0 | 17911 | 0 | 0 | 689.8 |
CQADupstackWordpressRetrieval | BeIR/cqadupstack/wordpress | CQADupStack: A Benchmark Data Set for Community Question-Answering Research | Retrieval | s2p | 1 | 0 | 0 | 49146 | 0 | 0 | 1111.9 |
DBPedia | BeIR/dbpedia-entity | DBpedia-Entity is a standard test collection for entity search over the DBpedia knowledge base | Retrieval | s2p | 1 | 0 | 4635989 | 4636322 | 0 | 310.2 | 310.1 |
FEVER | BeIR/fever | FEVER (Fact Extraction and VERification) consists of 185,445 claims generated by altering sentences extracted from Wikipedia and subsequently verified without knowledge of the sentence they were derived from. | Retrieval | s2p | 1 | 0 | 0 | 5423234 | 0 | 0 | 538.6 |
FiQA2018 | BeIR/fiqa | Financial Opinion Mining and Question Answering | Retrieval | s2p | 1 | 0 | 0 | 58286 | 0 | 0 | 760.4 |
HotpotQA | BeIR/hotpotqa | HotpotQA is a question answering dataset featuring natural, multi-hop questions, with strong supervision for supporting facts to enable more explainable question answering systems. | Retrieval | s2p | 1 | 0 | 0 | 5240734 | 0 | 0 | 288.6 |
MSMARCO | BeIR/msmarco | MS MARCO is a collection of datasets focused on deep learning in search. Note that the dev set is used for the leaderboard. | Retrieval | s2p | 1 | 0 | 8848803 | 8841866 | 0 | 336.6 | 336.8 |
MSMARCOv2 | BeIR/msmarco-v2 | MS MARCO is a collection of datasets focused on deep learning in search | Retrieval | s2p | 1 | 138641342 | 138368101 | 0 | 341.4 | 342.0 | 0 |
NFCorpus | BeIR/nfcorpus | NFCorpus: A Full-Text Learning to Rank Dataset for Medical Information Retrieval | Retrieval | s2p | 1 | 0 | 0 | 3956 | 0 | 0 | 1462.7 |
NQ | BeIR/nq | NFCorpus: A Full-Text Learning to Rank Dataset for Medical Information Retrieval | Retrieval | s2p | 1 | 0 | 0 | 2684920 | 0 | 0 | 492.7 |
QuoraRetrieval | BeIR/quora | QuoraRetrieval is based on questions that are marked as duplicates on the Quora platform. Given a question, find other (duplicate) questions. | Retrieval | s2s | 1 | 0 | 0 | 532931 | 0 | 0 | 62.9 |
SCIDOCS | BeIR/scidocs | SciDocs, a new evaluation benchmark consisting of seven document-level tasks ranging from citation prediction, to document classification and recommendation. | Retrieval | s2p | 1 | 0 | 0 | 26657 | 0 | 0 | 1161.9 |
SciFact | BeIR/scifact | SciFact verifies scientific claims using evidence from the research literature containing scientific paper abstracts. | Retrieval | s2p | 1 | 0 | 0 | 5483 | 0 | 0 | 1422.3 |
Touche2020 | BeIR/webis-touche2020 | Touché Task 1: Argument Retrieval for Controversial Questions | Retrieval | s2p | 1 | 0 | 0 | 382594 | 0 | 0 | 1720.1 |
TRECCOVID | BeIR/trec-covid | TRECCOVID is an ad-hoc search challenge based on the CORD-19 dataset containing scientific articles related to the COVID-19 pandemic | Retrieval | s2p | 1 | 0 | 0 | 171382 | 0 | 0 | 1117.4 |
ArguAna-PL | BeIR-PL/arguana-pl | NFCorpus: A Full-Text Learning to Rank Dataset for Medical Information Retrieval | Retrieval | p2p | 1 | 0 | 0 | 10080 | 0 | 0 | 1052.9 |
DBPedia-PL | BeIR-PL/dbpedia-pl | DBpedia-Entity is a standard test collection for entity search over the DBpedia knowledge base | Retrieval | s2p | 1 | 0 | 4635989 | 4636322 | 0 | 310.2 | 310.1 |
FiQA-PL | BeIR-PL/fiqa-pl | Financial Opinion Mining and Question Answering | Retrieval | s2p | 1 | 0 | 0 | 58286 | 0 | 0 | 760.4 |
HotpotQA-PL | BeIR-PL/hotpotqa-pl | HotpotQA is a question answering dataset featuring natural, multi-hop questions, with strong supervision for supporting facts to enable more explainable question answering systems. | Retrieval | s2p | 1 | 0 | 0 | 5240734 | 0 | 0 | 288.6 |
MSMARCO-PL | BeIR-PL/msmarco-pl | MS MARCO is a collection of datasets focused on deep learning in search. Note that the dev set is used for the leaderboard. | Retrieval | s2p | 1 | 0 | 8848803 | 8841866 | 0 | 336.6 | 336.8 |
NFCorpus-PL | BeIR-PL/nfcorpus-pl | NFCorpus: A Full-Text Learning to Rank Dataset for Medical Information Retrieval | Retrieval | s2p | 1 | 0 | 0 | 3956 | 0 | 0 | 1462.7 |
NQ-PL | BeIR-PL/nq-pl | Natural Questions: A Benchmark for Question Answering Research | Retrieval | s2p | 1 | 0 | 0 | 2684920 | 0 | 0 | 492.7 |
Quora-PL | BeIR-PL/quora-pl | QuoraRetrieval is based on questions that are marked as duplicates on the Quora platform. Given a question, find other (duplicate) questions. | Retrieval | s2s | 1 | 0 | 0 | 532931 | 0 | 0 | 62.9 |
SCIDOCS-PL | BeIR-PL/scidocs-pl | SciDocs, a new evaluation benchmark consisting of seven document-level tasks ranging from citation prediction, to document classification and recommendation. | Retrieval | s2p | 1 | 0 | 0 | 26657 | 0 | 0 | 1161.9 |
SciFact-PL | BeIR-PL/scifact-pl | SciFact verifies scientific claims using evidence from the research literature containing scientific paper abstracts. | Retrieval | s2p | 1 | 0 | 0 | 5483 | 0 | 0 | 1422.3 |
SweFAQ | AI-Sweden/SuperLim | Frequently asked questions from Swedish authorities' websites | Retrieval | s2p | 1 | 0 | 0 | 513 | 0 | 0 | 390.57 |
BIOSSES | mteb/biosses-sts | Biomedical Semantic Similarity Estimation. | STS | s2s | 1 | 0 | 0 | 200 | 0 | 0 | 156.6 |
SICK-R | mteb/sickr-sts | Semantic Textual Similarity SICK-R dataset as described here: | STS | s2s | 1 | 0 | 0 | 19854 | 0 | 0 | 46.1 |
STS12 | mteb/sts12-sts | SemEval STS 2012 dataset. | STS | s2s | 1 | 4468 | 0 | 6216 | 100.7 | 0 | 64.7 |
STS13 | mteb/sts13-sts | SemEval STS 2013 dataset. | STS | s2s | 1 | 0 | 0 | 3000 | 0 | 0 | 54.0 |
STS14 | mteb/sts14-sts | SemEval STS 2014 dataset. Currently only the English dataset | STS | s2s | 1 | 0 | 0 | 7500 | 0 | 0 | 54.3 |
STS15 | mteb/sts15-sts | SemEval STS 2015 dataset | STS | s2s | 1 | 0 | 0 | 6000 | 0 | 0 | 57.7 |
STS16 | mteb/sts16-sts | SemEval STS 2016 dataset | STS | s2s | 1 | 0 | 0 | 2372 | 0 | 0 | 65.3 |
STS17 | mteb/sts17-crosslingual-sts | STS 2017 dataset | STS | s2s | 11 | 0 | 0 | 500 | 0 | 0 | 43.3 |
STS22 | mteb/sts22-crosslingual-sts | SemEval 2022 Task 8: Multilingual News Article Similarity | STS | s2s | 18 | 0 | 0 | 8060 | 0 | 0 | 1992.8 |
STSBenchmark | mteb/stsbenchmark-sts | Semantic Textual Similarity Benchmark (STSbenchmark) dataset. | STS | s2s | 1 | 11498 | 3000 | 2758 | 57.6 | 64.0 | 53.6 |
SICK-R-PL | PL-MTEB/sickr-pl-sts | Polish version of SICK dataset for textual relatedness. | STS | s2s | 1 | 8878 | 990 | 9812 | 42.9 | 44.0 | 42.8 |
CDSC-R | PL-MTEB/cdscr-sts | Compositional Distributional Semantics Corpus for textual relatedness. | STS | s2s | 1 | 16000 | 2000 | 2000 | 72.1 | 73.2 | 75.0 |
SummEval | mteb/summeval | News Article Summary Semantic Similarity Estimation. | Summarization | s2s | 1 | 0 | 0 | 2800 | 0 | 0 | 359.8 |
For Chinese tasks, you can refer to C_MTEB.
If you find MTEB useful, feel free to cite our publication MTEB: Massive Text Embedding Benchmark:
@article{muennighoff2022mteb,
doi = {10.48550/ARXIV.2210.07316},
url = {https://arxiv.org/abs/2210.07316},
author = {Muennighoff, Niklas and Tazi, Nouamane and Magne, Lo{\"\i}c and Reimers, Nils},
title = {MTEB: Massive Text Embedding Benchmark},
publisher = {arXiv},
journal={arXiv preprint arXiv:2210.07316},
year = {2022}
}