Skip to content

Comments

Add e5-nl results on MTEB-NL#351

Merged
KennethEnevoldsen merged 1 commit intoembeddings-benchmark:mainfrom
nikolay-banar:e5-nl
Dec 5, 2025
Merged

Add e5-nl results on MTEB-NL#351
KennethEnevoldsen merged 1 commit intoembeddings-benchmark:mainfrom
nikolay-banar:e5-nl

Conversation

@nikolay-banar
Copy link
Contributor

@nikolay-banar nikolay-banar commented Dec 5, 2025

Checklist

  • My model has a model sheet, report, or similar
  • My model has a reference implementation in mteb/models/model_implementations/, this can be as an API. Instruction on how to add a model can be found here
  • The results submitted are obtained using the reference implementation
  • My model is available, either as a publicly accessible API or publicly on e.g., Huggingface
  • I solemnly swear that for all results submitted I have not trained on the evaluation dataset including training splits. If I have, I have disclosed it clearly.

@github-actions
Copy link

github-actions bot commented Dec 5, 2025

Model Results Comparison

Reference models: intfloat/multilingual-e5-large, google/gemini-embedding-001
New models evaluated: clips/e5-base-trm-nl, clips/e5-large-trm-nl, clips/e5-small-trm-nl
Tasks: ArguAna-NL.v2, BelebeleRetrieval, CovidDisinformationNLMultiLabelClassification, DutchBookReviewSentimentClassification.v2, DutchColaClassification, DutchGovernmentBiasClassification, DutchNewsArticlesClassification, DutchNewsArticlesClusteringP2P, DutchNewsArticlesClusteringS2S, DutchNewsArticlesRetrieval, DutchSarcasticHeadlinesClassification, IconclassClassification, IconclassClusteringS2S, LegalQANLRetrieval, MassiveIntentClassification, MassiveScenarioClassification, MultiEURLEXMultilabelClassification, MultiHateClassification, NFCorpus-NL.v2, OpenTenderClassification, OpenTenderClusteringP2P, OpenTenderClusteringS2S, OpenTenderRetrieval, SCIDOCS-NL.v2, SIB200Classification, SIB200ClusteringS2S, SICK-NL-STS, SICKNLPairClassification, STSBenchmarkMultilingualSTS, SciFact-NL.v2, VABBClusteringP2P, VABBClusteringS2S, VABBMultiLabelClassification, VABBRetrieval, VaccinChatNLClassification, WebFAQRetrieval, WikipediaRerankingMultilingual, WikipediaRetrievalMultilingual, XLWICNLPairClassification, bBSARDNLRetrieval

Results for clips/e5-base-trm-nl

task_name clips/e5-base-trm-nl google/gemini-embedding-001 intfloat/multilingual-e5-large Max result
ArguAna-NL.v2 0.4634 nan 0.4894 0.5603
BelebeleRetrieval 0.9380 0.9073 0.7791 0.9167
CovidDisinformationNLMultiLabelClassification 0.4873 nan 0.4970 0.5361
DutchBookReviewSentimentClassification.v2 0.6735 nan 0.6256 0.9228
DutchColaClassification 0.5611 nan 0.5676 0.5684
DutchGovernmentBiasClassification 0.6094 nan 0.6193 0.6193
DutchNewsArticlesClassification 0.5679 nan 0.5781 0.6236
DutchNewsArticlesClusteringP2P 0.3940 nan 0.4045 0.4742
DutchNewsArticlesClusteringS2S 0.2833 nan 0.2601 0.3471
DutchNewsArticlesRetrieval 0.6652 nan 0.7459 0.8200
DutchSarcasticHeadlinesClassification 0.6636 nan 0.7281 0.7281
IconclassClassification 0.5399 nan 0.5134 0.5724
IconclassClusteringS2S 0.2550 nan 0.2220 0.3077
LegalQANLRetrieval 0.6725 nan 0.7748 0.8267
MassiveIntentClassification 0.6273 0.8192 0.6591 0.9194
MassiveScenarioClassification 0.6876 0.8730 0.7012 0.9930
MultiEURLEXMultilabelClassification 0.0519 0.0528 0.0516 0.0561
MultiHateClassification 0.5806 0.7247 0.6357 0.8374
NFCorpus-NL.v2 0.2808 nan 0.2982 0.3301
OpenTenderClassification 0.4420 nan 0.4193 0.5166
OpenTenderClusteringP2P 0.3442 nan 0.2301 0.5051
OpenTenderClusteringS2S 0.2743 nan 0.1617 0.4659
OpenTenderRetrieval 0.3925 nan 0.3778 0.4871
SCIDOCS-NL.v2 0.1429 nan 0.1309 0.1833
SIB200Classification 0.7452 nan 0.7339 0.7968
SIB200ClusteringS2S 0.4121 0.4174 0.3945 0.5067
SICK-NL-STS 0.7375 nan 0.7692 0.8855
SICKNLPairClassification 0.8999 nan 0.9332 0.9711
STSBenchmarkMultilingualSTS 0.8066 nan 0.8349 0.9554
SciFact-NL.v2 0.6683 nan 0.6840 0.6958
VABBClusteringP2P 0.4234 nan 0.3437 0.5769
VABBClusteringS2S 0.3532 nan 0.3071 0.4452
VABBMultiLabelClassification 0.5389 nan 0.5233 0.5611
VABBRetrieval 0.7285 nan 0.7036 0.8100
VaccinChatNLClassification 0.4959 nan 0.5063 0.5768
WebFAQRetrieval 0.7430 nan 0.8072 0.8571
WikipediaRerankingMultilingual 0.8738 0.9224 0.8981 0.9308
WikipediaRetrievalMultilingual 0.8906 0.9420 0.9111 0.9420
XLWICNLPairClassification 0.6676 nan 0.6732 0.6956
bBSARDNLRetrieval 0.1987 nan 0.2384 0.3128
Average 0.5445 0.7074 0.5433 0.6409

Model have high performance on these tasks: BelebeleRetrieval


Results for clips/e5-large-trm-nl

task_name clips/e5-large-trm-nl google/gemini-embedding-001 intfloat/multilingual-e5-large Max result
ArguAna-NL.v2 0.4713 nan 0.4894 0.5603
BelebeleRetrieval 0.9305 0.9073 0.7791 0.9167
CovidDisinformationNLMultiLabelClassification 0.5159 nan 0.4970 0.5361
DutchBookReviewSentimentClassification.v2 0.7297 nan 0.6256 0.9228
DutchColaClassification 0.5592 nan 0.5676 0.5684
DutchGovernmentBiasClassification 0.6102 nan 0.6193 0.6193
DutchNewsArticlesClassification 0.5819 nan 0.5781 0.6236
DutchNewsArticlesClusteringP2P 0.4073 nan 0.4045 0.4742
DutchNewsArticlesClusteringS2S 0.3043 nan 0.2601 0.3471
DutchNewsArticlesRetrieval 0.7104 nan 0.7459 0.8200
DutchSarcasticHeadlinesClassification 0.7396 nan 0.7281 0.7281
IconclassClassification 0.5314 nan 0.5134 0.5724
IconclassClusteringS2S 0.2492 nan 0.2220 0.3077
LegalQANLRetrieval 0.7156 nan 0.7748 0.8267
MassiveIntentClassification 0.6510 0.8192 0.6591 0.9194
MassiveScenarioClassification 0.7081 0.8730 0.7012 0.9930
MultiEURLEXMultilabelClassification 0.0650 0.0528 0.0516 0.0561
MultiHateClassification 0.6520 0.7247 0.6357 0.8374
NFCorpus-NL.v2 0.3080 nan 0.2982 0.3301
OpenTenderClassification 0.4713 nan 0.4193 0.5166
OpenTenderClusteringP2P 0.3681 nan 0.2301 0.5051
OpenTenderClusteringS2S 0.2925 nan 0.1617 0.4659
OpenTenderRetrieval 0.4250 nan 0.3778 0.4871
SCIDOCS-NL.v2 0.1593 nan 0.1309 0.1833
SIB200Classification 0.7517 nan 0.7339 0.7968
SIB200ClusteringS2S 0.4362 0.4174 0.3945 0.5067
SICK-NL-STS 0.7576 nan 0.7692 0.8855
SICKNLPairClassification 0.9530 nan 0.9332 0.9711
STSBenchmarkMultilingualSTS 0.8280 nan 0.8349 0.9554
SciFact-NL.v2 0.6391 nan 0.6840 0.6958
VABBClusteringP2P 0.4564 nan 0.3437 0.5769
VABBClusteringS2S 0.3502 nan 0.3071 0.4452
VABBMultiLabelClassification 0.5551 nan 0.5233 0.5611
VABBRetrieval 0.7622 nan 0.7036 0.8100
VaccinChatNLClassification 0.5141 nan 0.5063 0.5768
WebFAQRetrieval 0.7425 nan 0.8072 0.8571
WikipediaRerankingMultilingual 0.8718 0.9224 0.8981 0.9308
WikipediaRetrievalMultilingual 0.8883 0.9420 0.9111 0.9420
XLWICNLPairClassification 0.6754 nan 0.6732 0.6956
bBSARDNLRetrieval 0.2268 nan 0.2384 0.3128
Average 0.5641 0.7074 0.5433 0.6409

Model have high performance on these tasks: BelebeleRetrieval,DutchSarcasticHeadlinesClassification,MultiEURLEXMultilabelClassification


Results for clips/e5-small-trm-nl

task_name clips/e5-small-trm-nl google/gemini-embedding-001 intfloat/multilingual-e5-large Max result
ArguAna-NL.v2 0.4628 nan 0.4894 0.5603
BelebeleRetrieval 0.9244 0.9073 0.7791 0.9167
CovidDisinformationNLMultiLabelClassification 0.4916 nan 0.4970 0.5361
DutchBookReviewSentimentClassification.v2 0.6255 nan 0.6256 0.9228
DutchColaClassification 0.5495 nan 0.5676 0.5684
DutchGovernmentBiasClassification 0.6146 nan 0.6193 0.6193
DutchNewsArticlesClassification 0.5749 nan 0.5781 0.6236
DutchNewsArticlesClusteringP2P 0.4146 nan 0.4045 0.4742
DutchNewsArticlesClusteringS2S 0.2647 nan 0.2601 0.3471
DutchNewsArticlesRetrieval 0.6664 nan 0.7459 0.8200
DutchSarcasticHeadlinesClassification 0.6645 nan 0.7281 0.7281
IconclassClassification 0.5182 nan 0.5134 0.5724
IconclassClusteringS2S 0.2257 nan 0.2220 0.3077
LegalQANLRetrieval 0.7118 nan 0.7748 0.8267
MassiveIntentClassification 0.5980 0.8192 0.6591 0.9194
MassiveScenarioClassification 0.6719 0.8730 0.7012 0.9930
MultiEURLEXMultilabelClassification 0.0504 0.0528 0.0516 0.0561
MultiHateClassification 0.5731 0.7247 0.6357 0.8374
NFCorpus-NL.v2 0.2918 nan 0.2982 0.3301
OpenTenderClassification 0.4347 nan 0.4193 0.5166
OpenTenderClusteringP2P 0.3234 nan 0.2301 0.5051
OpenTenderClusteringS2S 0.2320 nan 0.1617 0.4659
OpenTenderRetrieval 0.4154 nan 0.3778 0.4871
SCIDOCS-NL.v2 0.1345 nan 0.1309 0.1833
SIB200Classification 0.7282 nan 0.7339 0.7968
SIB200ClusteringS2S 0.3887 0.4174 0.3945 0.5067
SICK-NL-STS 0.7223 nan 0.7692 0.8855
SICKNLPairClassification 0.8805 nan 0.9332 0.9711
STSBenchmarkMultilingualSTS 0.7948 nan 0.8349 0.9554
SciFact-NL.v2 0.6457 nan 0.6840 0.6958
VABBClusteringP2P 0.4266 nan 0.3437 0.5769
VABBClusteringS2S 0.3470 nan 0.3071 0.4452
VABBMultiLabelClassification 0.5216 nan 0.5233 0.5611
VABBRetrieval 0.7312 nan 0.7036 0.8100
VaccinChatNLClassification 0.4518 nan 0.5063 0.5768
WebFAQRetrieval 0.7247 nan 0.8072 0.8571
WikipediaRerankingMultilingual 0.8709 0.9224 0.8981 0.9308
WikipediaRetrievalMultilingual 0.8869 0.9420 0.9111 0.9420
XLWICNLPairClassification 0.6397 nan 0.6732 0.6956
bBSARDNLRetrieval 0.1269 nan 0.2384 0.3128
Average 0.5330 0.7074 0.5433 0.6409

Model have high performance on these tasks: BelebeleRetrieval


@KennethEnevoldsen
Copy link
Contributor

Looks good maybe with the exception of BelebeleRetrieval. @nikolay-banar is there a chance that it might have been leaked to the training data?

@nikolay-banar
Copy link
Contributor Author

nikolay-banar commented Dec 5, 2025

Looks good maybe with the exception of BelebeleRetrieval. @nikolay-banar is there a chance that it might have been leaked to the training data?

BelebeleRetrieval has some issues with the result aggregation: #317 (comment)

The results for the nld subset are high for all models.

@KennethEnevoldsen
Copy link
Contributor

Ahh @Samoed should we make an issue on this? (sounds like you know where to look)

@Samoed
Copy link
Member

Samoed commented Dec 5, 2025

Currently, I don't know, but yes we should create an issue

@KennethEnevoldsen KennethEnevoldsen merged commit 2930358 into embeddings-benchmark:main Dec 5, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants