Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
90e9f43
fix: Correct metadata for ArguAna dataset (#3202)
whybe-choi Sep 21, 2025
920dafe
Update tasks & benchmarks tables
github-actions[bot] Sep 21, 2025
cd37c7a
1.38.57
Sep 21, 2025
6718290
model: Add BMRetriever (#3195)
whybe-choi Sep 22, 2025
6e72dc0
Revert "Ci: test out GH models with welcoming new comers" (#3206)
isaac-chung Sep 22, 2025
4f6d791
model: Add Codefuse models (#3205)
Geralt-Targaryen Sep 24, 2025
82d9e29
fix(models): ensure prompt_type is passed to format_instruction (#3216)
whybe-choi Sep 26, 2025
d0d427d
1.38.58
Sep 27, 2025
08bba49
Adding Cohere's output_dimension and embedding_type parameter (#3204)
fzoll Sep 27, 2025
e863bc1
dataset: add swedish cpc patent classifications to mteb (#3072)
Atheer2104 Sep 27, 2025
8c180d4
fix: AttributeError in ColPaliEngineWrapper similarity method (#3177)
FacerAin Sep 27, 2025
0aacba4
Update tasks & benchmarks tables
github-actions[bot] Sep 27, 2025
2e292cf
1.38.59
Sep 27, 2025
f58ac2b
fix: prevent EOS token truncation (#3218)
whybe-choi Sep 27, 2025
3e86531
1.38.60
Sep 27, 2025
15f9909
Update giga embeddings (#3210)
ekolodin Sep 29, 2025
cb03bd4
fix: Refactor split create_tables into static Benchmark methods (#3126)
q275343119 Sep 29, 2025
a52723a
1.38.61
Sep 29, 2025
4f58684
Extending the RTEB benchmark (#3223)
fzoll Sep 29, 2025
7f5990a
Update tasks & benchmarks tables
github-actions[bot] Sep 29, 2025
e299345
model: New qzmodel (#3211)
PennyYu123 Sep 30, 2025
0000ae2
model: Update Youtu embedding model (#3227)
spring-quan Sep 30, 2025
e56e7c4
dataset: Add Software Issue Localization Datasets (#3178)
tarsur909 Sep 30, 2025
65f29e6
Update tasks & benchmarks tables
github-actions[bot] Sep 30, 2025
11f9c1d
feat: Officially include RTEB in the leaderboard (#3222)
KennethEnevoldsen Oct 1, 2025
867105f
Update tasks & benchmarks tables
github-actions[bot] Oct 1, 2025
cf26684
1.39.0
Oct 1, 2025
600c290
fix: Add submission references for RTEB (#3233)
KennethEnevoldsen Oct 1, 2025
12fe80b
1.39.1
Oct 1, 2025
48a01fc
dataset: add human tasks and benchmark (#3214)
Samoed Oct 2, 2025
9a606a0
Update tasks & benchmarks tables
github-actions[bot] Oct 2, 2025
e419b54
Remove 'HUME(v1)' from leaderboard benchmark (#3236)
Samoed Oct 2, 2025
50aa4ac
docs: Update adding benchmark documentation (#3229)
Samoed Oct 2, 2025
a2f7488
fix: Further specified macro-language code for Norwegian (#3228)
KennethEnevoldsen Oct 2, 2025
810ae28
Update tasks & benchmarks tables
github-actions[bot] Oct 2, 2025
9249630
1.39.2
Oct 2, 2025
2f6eb2a
fix max tokens (#3243)
Muennighoff Oct 2, 2025
c21c20f
Merge branch 'main' into merge_main_05_10
Samoed Oct 3, 2025
3cea9e4
fix models
Samoed Oct 3, 2025
8902461
fix imports
Samoed Oct 3, 2025
3b95bb5
fix task import
Samoed Oct 3, 2025
56b0e4b
reupload HUME tasks
Samoed Oct 3, 2025
9a7723f
reupload SWE tasks
Samoed Oct 3, 2025
34a3b90
add stats
Samoed Oct 4, 2025
85e1dd9
fix python39 transformers compatibility (#3254)
Samoed Oct 5, 2025
36901eb
Aggregate by subset for HUMEv1 (#3255)
isaac-chung Oct 5, 2025
89bec7d
Update tasks & benchmarks tables
github-actions[bot] Oct 5, 2025
08b98cd
Fix AbsTaskTextRegression task (#3257)
AlexeyVatolin Oct 5, 2025
53b1c29
Added Japanese to Retrieval (#3252)
q275343119 Oct 5, 2025
c8ae52c
Update tasks & benchmarks tables
github-actions[bot] Oct 5, 2025
237d8dc
fix bm25 on small datasets (#3261)
Samoed Oct 6, 2025
6e2766d
Merge branch 'main' into merge_main_05_10
Samoed Oct 6, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions .github/ISSUE_TEMPLATE/eval_request.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
name: 📊 Evaluation Request
description: Create a request for a model to be evaluated in MTEB
title: "Evaluate model: {model_id}"
labels: ["evaluation request"]
body:
- type: input
attributes:
label: Model link on Hugging Face
description: Please provide a link to the model on Hugging Face. If the model is closed-source, please provide a link to the model provider or documentation.
validations:
required: true
- type: textarea
attributes:
label: What do you want it to be evaluated on?
description: Please specify the tasks or benchmarks you would like this model to be evaluated on.
validations:
required: True
- type: dropdown
id: contribute
attributes:
label: Are you interested in contributing to the evaluation of this model?
description: By default MTEB maintainters will only handle evaluation on private subsets due to resource constraints. If you are interested in contributing to the evaluation, please let us know.
options:
- "Yes"
- "No"
- type: dropdown
id: exists
attributes:
label: Does this model already exist in MTEB?
description: If you are unsure, please check using mteb model registry (e.g. using `mteb.get_model_meta("model_id")`).
options:
- "Yes"
- "No"
40 changes: 0 additions & 40 deletions .github/workflows/welcome_new_comers.yml

This file was deleted.

7 changes: 4 additions & 3 deletions docs/adding_a_benchmark.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

The MTEB Leaderboard is available [here](https://huggingface.co/spaces/mteb/leaderboard) and we encourage additions of new benchmarks. To add a new benchmark:

1. Add your benchmark to [benchmark.py](../mteb/benchmarks/benchmarks.py) as a `Benchmark` object, and select the MTEB tasks that will be in the benchmark. If some of the tasks do not exist in MTEB, follow the "add a dataset" instructions to add them.
2. Open a PR at https://github.com/embeddings-benchmark/results with results of models on your benchmark.
3. When PRs are merged, your benchmark will be added to the leaderboard automatically after the next workflow trigger.
1. Add your benchmark to [benchmark.py](../mteb/benchmarks/benchmarks/benchmarks.py) as a `Benchmark` object, and select the MTEB tasks that will be in the benchmark. If some of the tasks do not exist in MTEB, follow the ["add a dataset"](./adding_a_dataset.md) instructions to add them.
2. Add your benchmark to the most fitting section in [benchmark_selector.py](../mteb/leaderboard/benchmark_selector.py).
3. Open a PR at https://github.com/embeddings-benchmark/results with results of models on your benchmark.
4. When PRs are merged, your benchmark will be added to the leaderboard automatically after the next workflow trigger.
2 changes: 1 addition & 1 deletion docs/adding_a_dataset.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ To add a new task, you need to implement a new class that inherits from the `Abs
from mteb import MTEB
from mteb.abstasks.AbsTaskReranking import AbsTaskReranking
from sentence_transformers import SentenceTransformer
from mteb.abstasks.TaskMetadata import TaskMetadata
from mteb.abstasks.task_metadata import TaskMetadata

class SciDocsReranking(AbsTaskReranking):
metadata = TaskMetadata(
Expand Down
18 changes: 10 additions & 8 deletions docs/benchmarks.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,23 +10,24 @@ The following table gives you an overview of the benchmarks in MTEB.
|------|------------------|---------|------------|---------|-----------|
| [BEIR](https://arxiv.org/abs/2104.08663) | BEIR | 15 | Retrieval: 15 | [Academic, Blog, Encyclopaedic, Financial, Government, Medical, News, Non-fiction, Programming, Reviews, Social, Web, Written] | eng,vie |
| [BEIR-NL](https://arxiv.org/abs/2412.08329) | BEIR-NL | 15 | Retrieval: 15 | [Academic, Encyclopaedic, Medical, Non-fiction, Web, Written] | nld |
| [BRIGHT](https://brightbenchmark.github.io/) | BRIGHT | 1 | Retrieval: 1 | [Non-fiction, Written] | eng |
| [BRIGHT(long)](https://brightbenchmark.github.io/) | BRIGHT(long) | 1 | Retrieval: 1 | [Non-fiction, Written] | eng |
| [BRIGHT](https://brightbenchmark.github.io/) | Reasoning Retrieval | 1 | Retrieval: 1 | [Non-fiction, Written] | eng |
| [BRIGHT (long)](https://brightbenchmark.github.io/) | BRIGHT (long) | 1 | Retrieval: 1 | [Non-fiction, Written] | eng |
| [BuiltBench(eng)](https://arxiv.org/abs/2411.12056) | BuiltBench(eng) | 4 | Clustering: 2, Retrieval: 1, Reranking: 1 | [Engineering, Written] | eng |
| [ChemTEB](https://arxiv.org/abs/2412.00532) | Chemical | 27 | BitextMining: 1, Classification: 17, Clustering: 2, PairClassification: 5, Retrieval: 2 | [Chemistry] | ces,deu,eng,fra,hin,jpn,kor,msa,nld,por,spa,tur,zho |
| [CoIR](https://github.com/CoIR-team/coir) | Code Information Retrieval | 10 | Retrieval: 10 | [Programming, Written] | c++,eng,go,java,javascript,php,python,ruby,sql |
| [CodeRAG](https://arxiv.org/abs/2406.14497) | CodeRAG | 4 | Reranking: 4 | [Programming] | python |
| [Encodechka](https://github.com/avidale/encodechka) | Encodechka | 7 | STS: 2, Classification: 4, PairClassification: 1 | [Fiction, Government, News, Non-fiction, Social, Web, Written] | rus |
| [FollowIR](https://arxiv.org/abs/2403.15246) | Instruction Following | 3 | InstructionRetrieval: 3 | [News, Written] | eng |
| [HUME(v1)](Coming soon (in review)) | Human Benchmark | 16 | Classification: 4, Clustering: 4, Reranking: 4, STS: 4 | [Academic, Blog, Encyclopaedic, News, Reviews, Social, Web, Written] | ara,dan,eng,nob,rus |
| [JinaVDR](https://arxiv.org/abs/2506.18902) | Jina Visual Document Retrieval | 43 | DocumentUnderstanding: 43 | [Academic, Engineering, Government, Legal, Medical, News, Social, Web] | ara,ben,deu,eng,fra,hin,hun,ind,ita,jpn,kor,mya,nld,por,rus,spa,tha,urd,vie,zho |
| [LongEmbed](https://arxiv.org/abs/2404.12096v2) | Long-context Retrieval | 6 | Retrieval: 6 | [Academic, Blog, Encyclopaedic, Fiction, Non-fiction, Spoken, Written] | eng |
| [MIEB(Img)](https://arxiv.org/abs/2504.10471) | Image only | 49 | Any2AnyRetrieval: 15, ImageClassification: 22, ImageClustering: 5, VisualSTS(eng): 5, VisualSTS(multi): 2 | [Blog, Encyclopaedic, Medical, News, Non-fiction, Reviews, Scene, Social, Spoken, Web, Written] | ara,cmn,deu,eng,fra,ita,kor,nld,pol,por,rus,spa,tur |
| [MIEB(Multilingual)](https://arxiv.org/abs/2504.10471) | Image-Text, Multilingual | 130 | ImageClassification: 22, ImageClustering: 5, ZeroShotClassification: 23, VisionCentricQA: 6, Compositionality: 7, VisualSTS(eng): 7, Any2AnyRetrieval: 45, DocumentUnderstanding: 10, Any2AnyMultilingualRetrieval: 3, VisualSTS(multi): 2 | [Academic, Blog, Constructed, Encyclopaedic, Medical, News, Non-fiction, Reviews, Scene, Social, Spoken, Web, Written] | ara,ben,bul,ces,cmn,dan,deu,ell,eng,est,fas,fil,fin,fra,heb,hin,hrv,hun,ind,ita,jpn,kor,mri,nld,nor,pol,por,quz,ron,rus,spa,swa,swe,tel,tha,tur,ukr,vie,zho |
| [MIEB(Multilingual)](https://arxiv.org/abs/2504.10471) | Image-Text, Multilingual | 130 | ImageClassification: 22, ImageClustering: 5, ZeroShotClassification: 23, VisionCentricQA: 6, Compositionality: 7, VisualSTS(eng): 7, Any2AnyRetrieval: 45, DocumentUnderstanding: 10, Any2AnyMultilingualRetrieval: 3, VisualSTS(multi): 2 | [Academic, Blog, Constructed, Encyclopaedic, Medical, News, Non-fiction, Reviews, Scene, Social, Spoken, Web, Written] | ara,ben,bul,ces,cmn,dan,deu,ell,eng,est,fas,fil,fin,fra,heb,hin,hrv,hun,ind,ita,jpn,kor,mri,nld,nno,nob,nor,pol,por,quz,ron,rus,spa,swa,swe,tel,tha,tur,ukr,vie,zho |
| [MIEB(eng)](https://arxiv.org/abs/2504.10471) | Image-Text, English | 125 | ImageClassification: 22, ImageClustering: 5, ZeroShotClassification: 23, VisionCentricQA: 6, Compositionality: 7, VisualSTS(eng): 7, Any2AnyRetrieval: 45, DocumentUnderstanding: 10 | [Academic, Blog, Constructed, Encyclopaedic, Medical, News, Non-fiction, Reviews, Scene, Social, Spoken, Web, Written] | eng |
| [MIEB(lite)](https://arxiv.org/abs/2504.10471) | Image-Text, Lite | 51 | ImageClassification: 8, ImageClustering: 2, ZeroShotClassification: 7, VisionCentricQA: 5, Compositionality: 6, VisualSTS(eng): 2, VisualSTS(multi): 2, Any2AnyRetrieval: 11, DocumentUnderstanding: 6, Any2AnyMultilingualRetrieval: 2 | [Academic, Blog, Encyclopaedic, Medical, News, Non-fiction, Reviews, Scene, Social, Spoken, Web, Written] | ara,ben,bul,ces,cmn,dan,deu,ell,eng,est,fas,fil,fin,fra,heb,hin,hrv,hun,ind,ita,jpn,kor,mri,nld,nor,pol,por,quz,ron,rus,spa,swa,swe,tel,tha,tur,ukr,vie,zho |
| [MIEB(lite)](https://arxiv.org/abs/2504.10471) | Image-Text, Lite | 51 | ImageClassification: 8, ImageClustering: 2, ZeroShotClassification: 7, VisionCentricQA: 5, Compositionality: 6, VisualSTS(eng): 2, VisualSTS(multi): 2, Any2AnyRetrieval: 11, DocumentUnderstanding: 6, Any2AnyMultilingualRetrieval: 2 | [Academic, Blog, Encyclopaedic, Medical, News, Non-fiction, Reviews, Scene, Social, Spoken, Web, Written] | ara,ben,bul,ces,cmn,dan,deu,ell,eng,est,fas,fil,fin,fra,heb,hin,hrv,hun,ind,ita,jpn,kor,mri,nld,nno,nob,nor,pol,por,quz,ron,rus,spa,swa,swe,tel,tha,tur,ukr,vie,zho |
| [MINERSBitextMining](https://arxiv.org/pdf/2406.07424) | MINERSBitextMining | 7 | BitextMining: 7 | [Reviews, Social, Written] | abs,ace,afr,amh,ang,ara,arq,arz,ast,awa,aze,ban,bbc,bel,ben,ber,bew,bhp,bjn,bos,bre,bug,bul,cat,cbk,ceb,ces,cha,cmn,cor,csb,cym,dan,deu,dsb,dtp,ell,eng,epo,est,eus,fao,fin,fra,fry,gla,gle,glg,gsw,hau,heb,hin,hrv,hsb,hun,hye,ibo,ido,ile,ina,ind,isl,ita,jav,jpn,kab,kat,kaz,khm,kor,kur,kzj,lat,lfn,lit,lvs,mad,mak,mal,mar,max,mhr,min,mkd,mon,mui,nds,nij,nld,nno,nob,nov,oci,orv,pam,pcm,pes,pms,pol,por,rej,ron,rus,slk,slv,spa,sqi,srp,sun,swe,swg,swh,tam,tat,tel,tgl,tha,tuk,tur,tzl,uig,ukr,urd,uzb,vie,war,wuu,xho,yid,yor,yue,zsm |
| MTEB(Code, v1) | Code | 12 | Retrieval: 12 | [Programming, Written] | c,c++,eng,go,java,javascript,php,python,ruby,rust,scala,shell,sql,swift,typescript |
| [MTEB(Europe, v1)](https://arxiv.org/abs/2502.13595) | European | 74 | BitextMining: 7, Classification: 21, Clustering: 8, Retrieval: 15, InstructionRetrieval: 3, MultilabelClassification: 2, PairClassification: 6, Reranking: 3, STS: 9 | [Academic, Blog, Constructed, Encyclopaedic, Fiction, Financial, Government, Legal, Medical, News, Non-fiction, Programming, Religious, Reviews, Social, Spoken, Subtitles, Web, Written] | bul,ces,dan,deu,ell,eng,est,eus,fao,fin,fra,gle,hrv,hun,isl,ita,lav,lit,mlt,nld,nno,nob,pol,por,rom,ron,slk,slv,spa,swe |
| [MTEB(Europe, v1)](https://arxiv.org/abs/2502.13595) | European | 74 | BitextMining: 7, Classification: 21, Clustering: 8, Retrieval: 15, InstructionRetrieval: 3, MultilabelClassification: 2, PairClassification: 6, Reranking: 3, STS: 9 | [Academic, Blog, Constructed, Encyclopaedic, Fiction, Financial, Government, Legal, News, Non-fiction, Programming, Religious, Reviews, Social, Spoken, Subtitles, Web, Written] | bul,ces,dan,deu,ell,eng,est,eus,fao,fin,fra,gle,hrv,hun,isl,ita,lav,lit,mlt,nld,nno,nob,pol,por,rom,ron,slk,slv,spa,swe |
| [MTEB(Indic, v1)](https://arxiv.org/abs/2502.13595) | Indic | 23 | BitextMining: 4, Clustering: 1, Classification: 13, PairClassification: 1, Retrieval: 2, Reranking: 1, STS: 1 | [Constructed, Encyclopaedic, Fiction, Government, Legal, News, Non-fiction, Religious, Reviews, Social, Spoken, Web, Written] | asm,awa,ben,bgc,bho,bod,boy,brx,doi,eng,gbm,gom,guj,hin,hne,kan,kas,mai,mal,mar,mni,mup,mwr,nep,npi,ory,pan,pus,raj,san,sat,snd,tam,tel,urd |
| MTEB(Law, v1) | Legal | 8 | Retrieval: 8 | [Legal, Written] | deu,eng,zho |
| MTEB(Medical, v1) | Medical | 12 | Retrieval: 9, Clustering: 2, Reranking: 1 | [Academic, Government, Medical, Non-fiction, Web, Written] | ara,cmn,eng,fra,kor,pol,rus,spa,vie,zho |
Expand All @@ -46,15 +47,16 @@ The following table gives you an overview of the benchmarks in MTEB.
| [MTEB(rus, v1)](https://aclanthology.org/2023.eacl-main.148/) | Russian | 23 | Classification: 9, Clustering: 3, MultilabelClassification: 2, PairClassification: 1, Reranking: 2, Retrieval: 3, STS: 3 | [Academic, Blog, Encyclopaedic, News, Reviews, Social, Spoken, Web, Written] | rus |
| [NanoBEIR](https://huggingface.co/collections/zeta-alpha-ai/nanobeir-66e1a0af21dfd93e620cd9f6) | NanoBEIR | 13 | Retrieval: 13 | [Academic, Encyclopaedic, Medical, News, Non-fiction, Social, Web, Written] | eng |
| [R2MED](https://r2med.github.io/) | Reasoning-driven medical retrieval | 8 | Retrieval: 8 | [Medical] | eng |
| [RAR-b](https://arxiv.org/abs/2404.06347) | Reasoning retrieval | 17 | Retrieval: 17 | [Encyclopaedic, Programming, Written] | eng |
| [RAR-b](https://arxiv.org/abs/2404.06347) | Reasoning as retrieval | 17 | Retrieval: 17 | [Encyclopaedic, Programming, Written] | eng |
| RTEB(Code, beta) | RTEB Code | 8 | Retrieval: 8 | [Programming, Written] | eng,go,javascript,jpn,python,sql |
| RTEB(Health, beta) | RTEB Healthcare | 4 | Retrieval: 4 | [Academic, Medical, Written] | deu,eng,fra,spa |
| RTEB(Law, beta) | RTEB Legal | 7 | Retrieval: 7 | [Legal, Written] | deu,eng,fra,jpn |
| RTEB(beta) | RTEB Retrieval Embedding Benchmark | 28 | Retrieval: 28 | [Academic, Encyclopaedic, Financial, Legal, Medical, Non-fiction, Programming, Written] | deu,eng,fra,go,javascript,jpn,python,spa,sql |
| RTEB(beta) | RTEB Multilingual | 29 | Retrieval: 29 | [Academic, Encyclopaedic, Financial, Legal, Medical, Non-fiction, Programming, Written] | ara,ben,deu,eng,fas,fin,fra,go,hin,ind,javascript,jpn,kor,python,rus,spa,sql,swa,tel,tha,yor,zho |
| RTEB(deu, beta) | RTEB German | 4 | Retrieval: 4 | [Legal, Medical, Non-fiction, Written] | deu |
| RTEB(eng, beta) | RTEB English | 20 | Retrieval: 20 | [Academic, Financial, Legal, Medical, Non-fiction, Programming, Written] | eng,fra,go,javascript,python,spa,sql |
| RTEB(fin, beta) | RTEB Finance | 7 | Retrieval: 7 | [Financial, Non-fiction, Written] | eng |
| RTEB(fr, beta) | RTEB French | 3 | Retrieval: 3 | [Academic, Encyclopaedic, Legal, Medical, Written] | eng,fra |
| RTEB(fra, beta) | RTEB French | 3 | Retrieval: 3 | [Academic, Encyclopaedic, Legal, Medical, Written] | eng,fra |
| RTEB(jpn, beta) | RTEB Japanese | 2 | Retrieval: 2 | [Legal, Programming, Written] | jpn |
| [RuSciBench](https://link.springer.com/article/10.1134/S1064562424602191) | RuSciBench | 9 | BitextMining: 1, Classification: 4, Retrieval: 2, Regression: 2 | [Academic, Non-fiction, Written] | eng,rus |
| [VN-MTEB (vie, v1)](https://arxiv.org/abs/2507.21500) | Vietnamese | 50 | Retrieval: 24, Classification: 12, PairClassification: 3, Clustering: 5, Reranking: 3, STS: 3 | [Academic, Blog, Encyclopaedic, Financial, Government, Medical, News, Non-fiction, Programming, Reviews, Social, Spoken, Web, Written] | vie |
| [ViDoRe(v1)](https://arxiv.org/abs/2407.01449) | ViDoRe(v1) | 10 | DocumentUnderstanding: 10 | [Academic] | eng |
Expand Down
Loading