Skip to content

Comments

Add results for v2 datasets for MTEB-NL#317

Merged
Samoed merged 1 commit intoembeddings-benchmark:mainfrom
nikolay-banar:mteb-nl-upd
Nov 13, 2025
Merged

Add results for v2 datasets for MTEB-NL#317
Samoed merged 1 commit intoembeddings-benchmark:mainfrom
nikolay-banar:mteb-nl-upd

Conversation

@nikolay-banar
Copy link
Contributor

Checklist

  • My model has a model sheet, report or similar
  • My model has a reference implementation in mteb/models/ this can be as an API. Instruction on how to add a model can be found here
    • No, but there is an existing PR ___
  • The results submitted is obtained using the reference implementation
  • My model is available, either as a publicly accessible API or publicly on e.g., Huggingface
  • I solemnly swear that for all results submitted I have not trained on the evaluation dataset including training splits. If I have I have disclosed it clearly.

@github-actions
Copy link

Model Results Comparison

Reference models: intfloat/multilingual-e5-large, google/gemini-embedding-001
New models evaluated: Alibaba-NLP/gte-multilingual-base, BAAI/bge-m3, Snowflake/snowflake-arctic-embed-l-v2.0, Snowflake/snowflake-arctic-embed-m-v2.0, ibm-granite/granite-embedding-107m-multilingual, ibm-granite/granite-embedding-278m-multilingual, intfloat/multilingual-e5-base, intfloat/multilingual-e5-large, intfloat/multilingual-e5-small, minishlab/potion-multilingual-128M, sentence-transformers/LaBSE, sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2, sentence-transformers/paraphrase-multilingual-mpnet-base-v2, sentence-transformers/static-similarity-mrl-multilingual-v1
Tasks: ArguAna-NL, ArguAna-NL.v2, BelebeleRetrieval, CovidDisinformationNLMultiLabelClassification, DutchBookReviewSentimentClassification, DutchBookReviewSentimentClassification.v2, DutchColaClassification, DutchGovernmentBiasClassification, DutchNewsArticlesClassification, DutchNewsArticlesClusteringP2P, DutchNewsArticlesClusteringS2S, DutchNewsArticlesRetrieval, DutchSarcasticHeadlinesClassification, IconclassClassification, IconclassClusteringS2S, LegalQANLRetrieval, MassiveIntentClassification, MassiveScenarioClassification, MultiEURLEXMultilabelClassification, MultiHateClassification, NFCorpus-NL, NFCorpus-NL.v2, OpenTenderClassification, OpenTenderClusteringP2P, OpenTenderClusteringS2S, OpenTenderRetrieval, SCIDOCS-NL, SCIDOCS-NL.v2, SIB200Classification, SIB200ClusteringS2S, SICK-NL-STS, SICKNLPairClassification, STSBenchmarkMultilingualSTS, SciFact-NL, SciFact-NL.v2, VABBClusteringP2P, VABBClusteringS2S, VABBMultiLabelClassification, VABBRetrieval, VaccinChatNLClassification, WebFAQRetrieval, WikipediaRerankingMultilingual, WikipediaRetrievalMultilingual, XLWICNLPairClassification, bBSARDNLRetrieval

Results for Alibaba-NLP/gte-multilingual-base

task_name Alibaba-NLP/gte-multilingual-base Alibaba-NLP/gte-multilingual-base Alibaba-NLP/gte-multilingual-base google/gemini-embedding-001 intfloat/multilingual-e5-large Max result
Revisions 7fc06782350c1a83f88b15dd4b38ef853d3b8503 ca1791e0bcc104f6db161f27de1340241b13c5a4 external
ArguAna-NL.v2 nan 0.5283 nan nan 0.4894
BelebeleRetrieval 0.7760 0.8920 nan 0.9073 0.7791 0.9167
CovidDisinformationNLMultiLabelClassification nan 0.4972 nan nan 0.4970
DutchBookReviewSentimentClassification nan 0.7695 nan nan 0.7770 0.8821
DutchBookReviewSentimentClassification.v2 nan 0.7459 nan nan 0.6256
DutchColaClassification nan 0.5550 nan nan 0.5676
DutchGovernmentBiasClassification nan 0.6193 nan nan 0.6193
DutchNewsArticlesClassification nan 0.5300 nan nan 0.5781
DutchNewsArticlesClusteringP2P nan 0.3507 nan nan 0.4045
DutchNewsArticlesClusteringS2S nan 0.2790 nan nan 0.2601
DutchNewsArticlesRetrieval nan 0.8088 nan nan 0.7459
DutchSarcasticHeadlinesClassification nan 0.6578 nan nan 0.7281
IconclassClassification nan 0.5390 nan nan 0.5134
IconclassClusteringS2S nan 0.2595 nan nan 0.2220
LegalQANLRetrieval nan 0.6490 nan nan 0.7748
MassiveIntentClassification 0.5680 0.6193 0.6071 0.8192 0.6591 0.9194
MassiveScenarioClassification 0.6807 0.6985 0.6638 0.8730 0.7012 0.9930
MultiEURLEXMultilabelClassification 0.0094 0.0072 nan 0.0528 0.0516 0.0561
MultiHateClassification 0.6078 0.5824 nan 0.7247 0.6357 0.8374
NFCorpus-NL.v2 nan 0.2947 nan nan 0.2982
OpenTenderClassification nan 0.4336 nan nan 0.4193
OpenTenderClusteringP2P nan 0.3364 nan nan 0.2301
OpenTenderClusteringS2S nan 0.2761 nan nan 0.1617
OpenTenderRetrieval nan 0.4219 nan nan 0.3778
SCIDOCS-NL.v2 nan 0.1584 nan nan 0.1309
SIB200Classification nan 0.6725 nan nan 0.7339 0.7600
SIB200ClusteringS2S 0.2565 0.2411 nan 0.4174 0.3945 0.5067
SICK-NL-STS nan 0.7582 nan nan 0.7692
SICKNLPairClassification nan 0.9294 nan nan 0.9332
STSBenchmarkMultilingualSTS 0.8432 0.8302 0.8443 nan 0.8349 0.9554
SciFact-NL.v2 nan 0.6433 nan nan 0.6840
VABBClusteringP2P nan 0.4095 nan nan 0.3437
VABBClusteringS2S nan 0.3520 nan nan 0.3071
VABBMultiLabelClassification nan 0.5212 nan nan 0.5233
VABBRetrieval nan 0.7367 nan nan 0.7036
VaccinChatNLClassification nan 0.4881 nan nan 0.5063
WikipediaRerankingMultilingual 0.8263 0.8237 nan 0.9224 0.8970 0.9224
WikipediaRetrievalMultilingual 0.8369 0.8400 nan 0.9420 0.9082 0.9420
XLWICNLPairClassification nan 0.6274 nan nan 0.6732
bBSARDNLRetrieval nan 0.2009 nan nan 0.2384
Average 0.6005 0.5396 0.7051 0.7074 0.5425 0.7901

Results for BAAI/bge-m3

task_name BAAI/bge-m3 intfloat/multilingual-e5-large Max result
ArguAna-NL.v2 0.5213 0.4894
CovidDisinformationNLMultiLabelClassification 0.5097 0.4970
DutchBookReviewSentimentClassification 0.7790 0.7770 0.8821
DutchBookReviewSentimentClassification.v2 0.7624 0.6256
DutchColaClassification 0.5618 0.5676
DutchGovernmentBiasClassification 0.6136 0.6193
DutchNewsArticlesClassification 0.5564 0.5781
DutchNewsArticlesClusteringP2P 0.3928 0.4045
DutchNewsArticlesClusteringS2S 0.2027 0.2601
DutchNewsArticlesRetrieval 0.8167 0.7459
DutchSarcasticHeadlinesClassification 0.6534 0.7281
IconclassClassification 0.5128 0.5134
IconclassClusteringS2S 0.2332 0.2220
LegalQANLRetrieval 0.8123 0.7748
NFCorpus-NL.v2 0.2904 0.2982
OpenTenderClassification 0.4223 0.4193
OpenTenderClusteringP2P 0.2586 0.2301
OpenTenderClusteringS2S 0.2148 0.1617
OpenTenderRetrieval 0.3904 0.3778
SCIDOCS-NL.v2 0.1484 0.1309
SICK-NL-STS 0.7634 0.7692
SICKNLPairClassification 0.9224 0.9332
SciFact-NL.v2 0.6287 0.6840
VABBClusteringP2P 0.3641 0.3437
VABBClusteringS2S 0.2878 0.3071
VABBMultiLabelClassification 0.5150 0.5233
VABBRetrieval 0.7496 0.7036
VaccinChatNLClassification 0.5242 0.5063
XLWICNLPairClassification 0.6435 0.6732
bBSARDNLRetrieval 0.2407 0.2384
Average 0.5097 0.5034 0.8821

Results for Snowflake/snowflake-arctic-embed-l-v2.0

task_name Snowflake/snowflake-arctic-embed-l-v2.0 intfloat/multilingual-e5-large Max result
ArguAna-NL.v2 0.5312 0.4894
DutchBookReviewSentimentClassification.v2 0.6186 0.6256
NFCorpus-NL.v2 0.3006 0.2982
SCIDOCS-NL.v2 0.1783 0.1309
SciFact-NL.v2 0.6920 0.6840
Average 0.4642 0.4457

Results for Snowflake/snowflake-arctic-embed-m-v2.0

task_name Snowflake/snowflake-arctic-embed-m-v2.0 intfloat/multilingual-e5-large Max result
ArguAna-NL 0.4708 0.4894 0.7396
ArguAna-NL.v2 0.4711 0.4894
CovidDisinformationNLMultiLabelClassification 0.4978 0.4970
DutchBookReviewSentimentClassification 0.5659 0.7770 0.8821
DutchBookReviewSentimentClassification.v2 0.5708 0.6256
DutchColaClassification 0.5359 0.5676
DutchGovernmentBiasClassification 0.6119 0.6193
DutchNewsArticlesClassification 0.5246 0.5781
DutchNewsArticlesClusteringP2P 0.3435 0.4045
DutchNewsArticlesClusteringS2S 0.1955 0.2601
DutchNewsArticlesRetrieval 0.6905 0.7459
DutchSarcasticHeadlinesClassification 0.6141 0.7281
IconclassClassification 0.5157 0.5134
IconclassClusteringS2S 0.2387 0.2220
LegalQANLRetrieval 0.7416 0.7748
NFCorpus-NL 0.2765 0.2982 0.3700
NFCorpus-NL.v2 0.2764 0.2982
OpenTenderClassification 0.3724 0.4193
OpenTenderClusteringP2P 0.1802 0.2301
OpenTenderClusteringS2S 0.1602 0.1617
OpenTenderRetrieval 0.4476 0.3778
SCIDOCS-NL 0.1538 0.1309 0.2477
SCIDOCS-NL.v2 0.1538 0.1309
SIB200Classification 0.6684 0.7339 0.7600
SICK-NL-STS 0.6373 0.7692
SICKNLPairClassification 0.7312 0.9332
SciFact-NL 0.6789 0.6840 0.8023
SciFact-NL.v2 0.6799 0.6840
VABBClusteringP2P 0.3785 0.3437
VABBClusteringS2S 0.2918 0.3071
VABBMultiLabelClassification 0.5013 0.5233
VABBRetrieval 0.7394 0.7036
VaccinChatNLClassification 0.4256 0.5063
WebFAQRetrieval 0.7132 0.8072 0.8571
XLWICNLPairClassification 0.6019 0.6732
bBSARDNLRetrieval 0.1635 0.2384
Average 0.4672 0.5069 0.6655

Results for ibm-granite/granite-embedding-107m-multilingual

task_name ibm-granite/granite-embedding-107m-multilingual intfloat/multilingual-e5-large Max result
ArguAna-NL.v2 0.4548 0.4894
DutchBookReviewSentimentClassification.v2 0.5534 0.6256
NFCorpus-NL.v2 0.2354 0.2982
SCIDOCS-NL.v2 0.1394 0.1309
SciFact-NL.v2 0.5888 0.6840
Average 0.3944 0.4457

Results for ibm-granite/granite-embedding-278m-multilingual

task_name ibm-granite/granite-embedding-278m-multilingual intfloat/multilingual-e5-large Max result
ArguAna-NL.v2 0.4823 0.4894
DutchBookReviewSentimentClassification.v2 0.5569 0.6256
NFCorpus-NL.v2 0.2427 0.2982
SCIDOCS-NL.v2 0.1430 0.1309
SciFact-NL.v2 0.6019 0.6840
Average 0.4054 0.4457

Results for intfloat/multilingual-e5-base

task_name intfloat/multilingual-e5-base intfloat/multilingual-e5-large Max result
ArguAna-NL.v2 0.4510 0.4894
DutchBookReviewSentimentClassification.v2 0.6447 0.6256
NFCorpus-NL.v2 0.2711 0.2982
SCIDOCS-NL.v2 0.1237 0.1309
SciFact-NL.v2 0.6676 0.6840
Average 0.4316 0.4457

Results for intfloat/multilingual-e5-large

task_name intfloat/multilingual-e5-large Max result
ArguAna-NL.v2 0.4894
DutchBookReviewSentimentClassification.v2 0.6256
NFCorpus-NL.v2 0.2982
SCIDOCS-NL.v2 0.1309
SciFact-NL.v2 0.6840
Average 0.4457

Results for intfloat/multilingual-e5-small

task_name intfloat/multilingual-e5-large intfloat/multilingual-e5-small Max result
ArguAna-NL.v2 0.4894 0.3989
DutchBookReviewSentimentClassification.v2 0.6256 0.6021
NFCorpus-NL.v2 0.2982 0.2797
SCIDOCS-NL.v2 0.1309 0.089
SciFact-NL.v2 0.6840 0.6093
Average 0.4457 0.3958

Results for minishlab/potion-multilingual-128M

task_name intfloat/multilingual-e5-large minishlab/potion-multilingual-128M Max result
ArguAna-NL.v2 0.4894 0.3669
DutchBookReviewSentimentClassification.v2 0.6256 0.562
NFCorpus-NL.v2 0.2982 0.1641
SCIDOCS-NL.v2 0.1309 0.0696
SciFact-NL.v2 0.6840 0.4141
Average 0.4457 0.3153

Results for sentence-transformers/LaBSE

task_name intfloat/multilingual-e5-large sentence-transformers/LaBSE Max result
ArguAna-NL.v2 0.4894 0.3924
DutchBookReviewSentimentClassification.v2 0.6256 0.6057
NFCorpus-NL.v2 0.2982 0.1549
SCIDOCS-NL.v2 0.1309 0.0629
SciFact-NL.v2 0.6840 0.3896
Average 0.4457 0.3211

Results for sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2

task_name intfloat/multilingual-e5-large sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 Max result
ArguAna-NL.v2 0.4894 0.3291
DutchBookReviewSentimentClassification.v2 0.6256 0.5776
NFCorpus-NL.v2 0.2982 0.163
SCIDOCS-NL.v2 0.1309 0.0958
SciFact-NL.v2 0.6840 0.3382
Average 0.4457 0.3007

Results for sentence-transformers/paraphrase-multilingual-mpnet-base-v2

task_name intfloat/multilingual-e5-large sentence-transformers/paraphrase-multilingual-mpnet-base-v2 Max result
ArguAna-NL.v2 0.4894 0.3434
DutchBookReviewSentimentClassification.v2 0.6256 0.5975
NFCorpus-NL.v2 0.2982 0.1855
SCIDOCS-NL.v2 0.1309 0.1118
SciFact-NL.v2 0.6840 0.4223
Average 0.4457 0.3321

Results for sentence-transformers/static-similarity-mrl-multilingual-v1

task_name intfloat/multilingual-e5-large sentence-transformers/static-similarity-mrl-multilingual-v1 Max result
ArguAna-NL.v2 0.4894 0.3569
DutchBookReviewSentimentClassification.v2 0.6256 0.5951
NFCorpus-NL.v2 0.2982 0.193
SCIDOCS-NL.v2 0.1309 0.0773
SciFact-NL.v2 0.6840 0.4234
Average 0.4457 0.3292

@Samoed
Copy link
Member

Samoed commented Nov 12, 2025

Interesting difference for Alibaba-NLP/gte-multilingual-base. Old results were added in 7a67ef5, but we didn't have it's implementation in our repo (it was added later in embeddings-benchmark/mteb#1811), but between these revisions only readme was changed

@nikolay-banar
Copy link
Contributor Author

Interesting difference for Alibaba-NLP/gte-multilingual-base. Old results were added in 7a67ef5, but we didn't have it's implementation in our repo (it was added later in embeddings-benchmark/mteb#1811), but between these revisions only readme was changed

There are some fluctuations in the results, but the biggest one is for BelebeleRetrieval: 0.7760 (7fc06782350c1a83f88b15dd4b38ef853d3b8503) vs 0.8920 (ca1791e0bcc104f6db161f27de1340241b13c5a4)

However, when I check the old file directly, the results are the same.

results/Alibaba-NLP__gte-multilingual-base/7fc06782350c1a83f88b15dd4b38ef853d3b8503/BelebeleRetrieval.json

  "hf_subset": "nld_Latn-nld_Latn",
    "languages": [
      "nld-Latn",
      "nld-Latn"
    ],
    "main_score": 0.89204,

@Samoed
Copy link
Member

Samoed commented Nov 13, 2025

Yes, I see. Maybe then problem in table generation script

@Samoed Samoed merged commit 18d477f into embeddings-benchmark:main Nov 13, 2025
3 checks passed
@nikolay-banar nikolay-banar deleted the mteb-nl-upd branch November 18, 2025 09:34
@nikolay-banar nikolay-banar mentioned this pull request Dec 5, 2025
6 tasks
@Samoed Samoed mentioned this pull request Jan 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants