Add results for v2 datasets for MTEB-NL#317
Conversation
Model Results ComparisonReference models: Results for
|
| task_name | Alibaba-NLP/gte-multilingual-base | Alibaba-NLP/gte-multilingual-base | Alibaba-NLP/gte-multilingual-base | google/gemini-embedding-001 | intfloat/multilingual-e5-large | Max result |
|---|---|---|---|---|---|---|
| Revisions | 7fc06782350c1a83f88b15dd4b38ef853d3b8503 | ca1791e0bcc104f6db161f27de1340241b13c5a4 | external | |||
| ArguAna-NL.v2 | nan | 0.5283 | nan | nan | 0.4894 | |
| BelebeleRetrieval | 0.7760 | 0.8920 | nan | 0.9073 | 0.7791 | 0.9167 |
| CovidDisinformationNLMultiLabelClassification | nan | 0.4972 | nan | nan | 0.4970 | |
| DutchBookReviewSentimentClassification | nan | 0.7695 | nan | nan | 0.7770 | 0.8821 |
| DutchBookReviewSentimentClassification.v2 | nan | 0.7459 | nan | nan | 0.6256 | |
| DutchColaClassification | nan | 0.5550 | nan | nan | 0.5676 | |
| DutchGovernmentBiasClassification | nan | 0.6193 | nan | nan | 0.6193 | |
| DutchNewsArticlesClassification | nan | 0.5300 | nan | nan | 0.5781 | |
| DutchNewsArticlesClusteringP2P | nan | 0.3507 | nan | nan | 0.4045 | |
| DutchNewsArticlesClusteringS2S | nan | 0.2790 | nan | nan | 0.2601 | |
| DutchNewsArticlesRetrieval | nan | 0.8088 | nan | nan | 0.7459 | |
| DutchSarcasticHeadlinesClassification | nan | 0.6578 | nan | nan | 0.7281 | |
| IconclassClassification | nan | 0.5390 | nan | nan | 0.5134 | |
| IconclassClusteringS2S | nan | 0.2595 | nan | nan | 0.2220 | |
| LegalQANLRetrieval | nan | 0.6490 | nan | nan | 0.7748 | |
| MassiveIntentClassification | 0.5680 | 0.6193 | 0.6071 | 0.8192 | 0.6591 | 0.9194 |
| MassiveScenarioClassification | 0.6807 | 0.6985 | 0.6638 | 0.8730 | 0.7012 | 0.9930 |
| MultiEURLEXMultilabelClassification | 0.0094 | 0.0072 | nan | 0.0528 | 0.0516 | 0.0561 |
| MultiHateClassification | 0.6078 | 0.5824 | nan | 0.7247 | 0.6357 | 0.8374 |
| NFCorpus-NL.v2 | nan | 0.2947 | nan | nan | 0.2982 | |
| OpenTenderClassification | nan | 0.4336 | nan | nan | 0.4193 | |
| OpenTenderClusteringP2P | nan | 0.3364 | nan | nan | 0.2301 | |
| OpenTenderClusteringS2S | nan | 0.2761 | nan | nan | 0.1617 | |
| OpenTenderRetrieval | nan | 0.4219 | nan | nan | 0.3778 | |
| SCIDOCS-NL.v2 | nan | 0.1584 | nan | nan | 0.1309 | |
| SIB200Classification | nan | 0.6725 | nan | nan | 0.7339 | 0.7600 |
| SIB200ClusteringS2S | 0.2565 | 0.2411 | nan | 0.4174 | 0.3945 | 0.5067 |
| SICK-NL-STS | nan | 0.7582 | nan | nan | 0.7692 | |
| SICKNLPairClassification | nan | 0.9294 | nan | nan | 0.9332 | |
| STSBenchmarkMultilingualSTS | 0.8432 | 0.8302 | 0.8443 | nan | 0.8349 | 0.9554 |
| SciFact-NL.v2 | nan | 0.6433 | nan | nan | 0.6840 | |
| VABBClusteringP2P | nan | 0.4095 | nan | nan | 0.3437 | |
| VABBClusteringS2S | nan | 0.3520 | nan | nan | 0.3071 | |
| VABBMultiLabelClassification | nan | 0.5212 | nan | nan | 0.5233 | |
| VABBRetrieval | nan | 0.7367 | nan | nan | 0.7036 | |
| VaccinChatNLClassification | nan | 0.4881 | nan | nan | 0.5063 | |
| WikipediaRerankingMultilingual | 0.8263 | 0.8237 | nan | 0.9224 | 0.8970 | 0.9224 |
| WikipediaRetrievalMultilingual | 0.8369 | 0.8400 | nan | 0.9420 | 0.9082 | 0.9420 |
| XLWICNLPairClassification | nan | 0.6274 | nan | nan | 0.6732 | |
| bBSARDNLRetrieval | nan | 0.2009 | nan | nan | 0.2384 | |
| Average | 0.6005 | 0.5396 | 0.7051 | 0.7074 | 0.5425 | 0.7901 |
Results for BAAI/bge-m3
| task_name | BAAI/bge-m3 | intfloat/multilingual-e5-large | Max result |
|---|---|---|---|
| ArguAna-NL.v2 | 0.5213 | 0.4894 | |
| CovidDisinformationNLMultiLabelClassification | 0.5097 | 0.4970 | |
| DutchBookReviewSentimentClassification | 0.7790 | 0.7770 | 0.8821 |
| DutchBookReviewSentimentClassification.v2 | 0.7624 | 0.6256 | |
| DutchColaClassification | 0.5618 | 0.5676 | |
| DutchGovernmentBiasClassification | 0.6136 | 0.6193 | |
| DutchNewsArticlesClassification | 0.5564 | 0.5781 | |
| DutchNewsArticlesClusteringP2P | 0.3928 | 0.4045 | |
| DutchNewsArticlesClusteringS2S | 0.2027 | 0.2601 | |
| DutchNewsArticlesRetrieval | 0.8167 | 0.7459 | |
| DutchSarcasticHeadlinesClassification | 0.6534 | 0.7281 | |
| IconclassClassification | 0.5128 | 0.5134 | |
| IconclassClusteringS2S | 0.2332 | 0.2220 | |
| LegalQANLRetrieval | 0.8123 | 0.7748 | |
| NFCorpus-NL.v2 | 0.2904 | 0.2982 | |
| OpenTenderClassification | 0.4223 | 0.4193 | |
| OpenTenderClusteringP2P | 0.2586 | 0.2301 | |
| OpenTenderClusteringS2S | 0.2148 | 0.1617 | |
| OpenTenderRetrieval | 0.3904 | 0.3778 | |
| SCIDOCS-NL.v2 | 0.1484 | 0.1309 | |
| SICK-NL-STS | 0.7634 | 0.7692 | |
| SICKNLPairClassification | 0.9224 | 0.9332 | |
| SciFact-NL.v2 | 0.6287 | 0.6840 | |
| VABBClusteringP2P | 0.3641 | 0.3437 | |
| VABBClusteringS2S | 0.2878 | 0.3071 | |
| VABBMultiLabelClassification | 0.5150 | 0.5233 | |
| VABBRetrieval | 0.7496 | 0.7036 | |
| VaccinChatNLClassification | 0.5242 | 0.5063 | |
| XLWICNLPairClassification | 0.6435 | 0.6732 | |
| bBSARDNLRetrieval | 0.2407 | 0.2384 | |
| Average | 0.5097 | 0.5034 | 0.8821 |
Results for Snowflake/snowflake-arctic-embed-l-v2.0
| task_name | Snowflake/snowflake-arctic-embed-l-v2.0 | intfloat/multilingual-e5-large | Max result |
|---|---|---|---|
| ArguAna-NL.v2 | 0.5312 | 0.4894 | |
| DutchBookReviewSentimentClassification.v2 | 0.6186 | 0.6256 | |
| NFCorpus-NL.v2 | 0.3006 | 0.2982 | |
| SCIDOCS-NL.v2 | 0.1783 | 0.1309 | |
| SciFact-NL.v2 | 0.6920 | 0.6840 | |
| Average | 0.4642 | 0.4457 |
Results for Snowflake/snowflake-arctic-embed-m-v2.0
| task_name | Snowflake/snowflake-arctic-embed-m-v2.0 | intfloat/multilingual-e5-large | Max result |
|---|---|---|---|
| ArguAna-NL | 0.4708 | 0.4894 | 0.7396 |
| ArguAna-NL.v2 | 0.4711 | 0.4894 | |
| CovidDisinformationNLMultiLabelClassification | 0.4978 | 0.4970 | |
| DutchBookReviewSentimentClassification | 0.5659 | 0.7770 | 0.8821 |
| DutchBookReviewSentimentClassification.v2 | 0.5708 | 0.6256 | |
| DutchColaClassification | 0.5359 | 0.5676 | |
| DutchGovernmentBiasClassification | 0.6119 | 0.6193 | |
| DutchNewsArticlesClassification | 0.5246 | 0.5781 | |
| DutchNewsArticlesClusteringP2P | 0.3435 | 0.4045 | |
| DutchNewsArticlesClusteringS2S | 0.1955 | 0.2601 | |
| DutchNewsArticlesRetrieval | 0.6905 | 0.7459 | |
| DutchSarcasticHeadlinesClassification | 0.6141 | 0.7281 | |
| IconclassClassification | 0.5157 | 0.5134 | |
| IconclassClusteringS2S | 0.2387 | 0.2220 | |
| LegalQANLRetrieval | 0.7416 | 0.7748 | |
| NFCorpus-NL | 0.2765 | 0.2982 | 0.3700 |
| NFCorpus-NL.v2 | 0.2764 | 0.2982 | |
| OpenTenderClassification | 0.3724 | 0.4193 | |
| OpenTenderClusteringP2P | 0.1802 | 0.2301 | |
| OpenTenderClusteringS2S | 0.1602 | 0.1617 | |
| OpenTenderRetrieval | 0.4476 | 0.3778 | |
| SCIDOCS-NL | 0.1538 | 0.1309 | 0.2477 |
| SCIDOCS-NL.v2 | 0.1538 | 0.1309 | |
| SIB200Classification | 0.6684 | 0.7339 | 0.7600 |
| SICK-NL-STS | 0.6373 | 0.7692 | |
| SICKNLPairClassification | 0.7312 | 0.9332 | |
| SciFact-NL | 0.6789 | 0.6840 | 0.8023 |
| SciFact-NL.v2 | 0.6799 | 0.6840 | |
| VABBClusteringP2P | 0.3785 | 0.3437 | |
| VABBClusteringS2S | 0.2918 | 0.3071 | |
| VABBMultiLabelClassification | 0.5013 | 0.5233 | |
| VABBRetrieval | 0.7394 | 0.7036 | |
| VaccinChatNLClassification | 0.4256 | 0.5063 | |
| WebFAQRetrieval | 0.7132 | 0.8072 | 0.8571 |
| XLWICNLPairClassification | 0.6019 | 0.6732 | |
| bBSARDNLRetrieval | 0.1635 | 0.2384 | |
| Average | 0.4672 | 0.5069 | 0.6655 |
Results for ibm-granite/granite-embedding-107m-multilingual
| task_name | ibm-granite/granite-embedding-107m-multilingual | intfloat/multilingual-e5-large | Max result |
|---|---|---|---|
| ArguAna-NL.v2 | 0.4548 | 0.4894 | |
| DutchBookReviewSentimentClassification.v2 | 0.5534 | 0.6256 | |
| NFCorpus-NL.v2 | 0.2354 | 0.2982 | |
| SCIDOCS-NL.v2 | 0.1394 | 0.1309 | |
| SciFact-NL.v2 | 0.5888 | 0.6840 | |
| Average | 0.3944 | 0.4457 |
Results for ibm-granite/granite-embedding-278m-multilingual
| task_name | ibm-granite/granite-embedding-278m-multilingual | intfloat/multilingual-e5-large | Max result |
|---|---|---|---|
| ArguAna-NL.v2 | 0.4823 | 0.4894 | |
| DutchBookReviewSentimentClassification.v2 | 0.5569 | 0.6256 | |
| NFCorpus-NL.v2 | 0.2427 | 0.2982 | |
| SCIDOCS-NL.v2 | 0.1430 | 0.1309 | |
| SciFact-NL.v2 | 0.6019 | 0.6840 | |
| Average | 0.4054 | 0.4457 |
Results for intfloat/multilingual-e5-base
| task_name | intfloat/multilingual-e5-base | intfloat/multilingual-e5-large | Max result |
|---|---|---|---|
| ArguAna-NL.v2 | 0.4510 | 0.4894 | |
| DutchBookReviewSentimentClassification.v2 | 0.6447 | 0.6256 | |
| NFCorpus-NL.v2 | 0.2711 | 0.2982 | |
| SCIDOCS-NL.v2 | 0.1237 | 0.1309 | |
| SciFact-NL.v2 | 0.6676 | 0.6840 | |
| Average | 0.4316 | 0.4457 |
Results for intfloat/multilingual-e5-large
| task_name | intfloat/multilingual-e5-large | Max result |
|---|---|---|
| ArguAna-NL.v2 | 0.4894 | |
| DutchBookReviewSentimentClassification.v2 | 0.6256 | |
| NFCorpus-NL.v2 | 0.2982 | |
| SCIDOCS-NL.v2 | 0.1309 | |
| SciFact-NL.v2 | 0.6840 | |
| Average | 0.4457 |
Results for intfloat/multilingual-e5-small
| task_name | intfloat/multilingual-e5-large | intfloat/multilingual-e5-small | Max result |
|---|---|---|---|
| ArguAna-NL.v2 | 0.4894 | 0.3989 | |
| DutchBookReviewSentimentClassification.v2 | 0.6256 | 0.6021 | |
| NFCorpus-NL.v2 | 0.2982 | 0.2797 | |
| SCIDOCS-NL.v2 | 0.1309 | 0.089 | |
| SciFact-NL.v2 | 0.6840 | 0.6093 | |
| Average | 0.4457 | 0.3958 |
Results for minishlab/potion-multilingual-128M
| task_name | intfloat/multilingual-e5-large | minishlab/potion-multilingual-128M | Max result |
|---|---|---|---|
| ArguAna-NL.v2 | 0.4894 | 0.3669 | |
| DutchBookReviewSentimentClassification.v2 | 0.6256 | 0.562 | |
| NFCorpus-NL.v2 | 0.2982 | 0.1641 | |
| SCIDOCS-NL.v2 | 0.1309 | 0.0696 | |
| SciFact-NL.v2 | 0.6840 | 0.4141 | |
| Average | 0.4457 | 0.3153 |
Results for sentence-transformers/LaBSE
| task_name | intfloat/multilingual-e5-large | sentence-transformers/LaBSE | Max result |
|---|---|---|---|
| ArguAna-NL.v2 | 0.4894 | 0.3924 | |
| DutchBookReviewSentimentClassification.v2 | 0.6256 | 0.6057 | |
| NFCorpus-NL.v2 | 0.2982 | 0.1549 | |
| SCIDOCS-NL.v2 | 0.1309 | 0.0629 | |
| SciFact-NL.v2 | 0.6840 | 0.3896 | |
| Average | 0.4457 | 0.3211 |
Results for sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
| task_name | intfloat/multilingual-e5-large | sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 | Max result |
|---|---|---|---|
| ArguAna-NL.v2 | 0.4894 | 0.3291 | |
| DutchBookReviewSentimentClassification.v2 | 0.6256 | 0.5776 | |
| NFCorpus-NL.v2 | 0.2982 | 0.163 | |
| SCIDOCS-NL.v2 | 0.1309 | 0.0958 | |
| SciFact-NL.v2 | 0.6840 | 0.3382 | |
| Average | 0.4457 | 0.3007 |
Results for sentence-transformers/paraphrase-multilingual-mpnet-base-v2
| task_name | intfloat/multilingual-e5-large | sentence-transformers/paraphrase-multilingual-mpnet-base-v2 | Max result |
|---|---|---|---|
| ArguAna-NL.v2 | 0.4894 | 0.3434 | |
| DutchBookReviewSentimentClassification.v2 | 0.6256 | 0.5975 | |
| NFCorpus-NL.v2 | 0.2982 | 0.1855 | |
| SCIDOCS-NL.v2 | 0.1309 | 0.1118 | |
| SciFact-NL.v2 | 0.6840 | 0.4223 | |
| Average | 0.4457 | 0.3321 |
Results for sentence-transformers/static-similarity-mrl-multilingual-v1
| task_name | intfloat/multilingual-e5-large | sentence-transformers/static-similarity-mrl-multilingual-v1 | Max result |
|---|---|---|---|
| ArguAna-NL.v2 | 0.4894 | 0.3569 | |
| DutchBookReviewSentimentClassification.v2 | 0.6256 | 0.5951 | |
| NFCorpus-NL.v2 | 0.2982 | 0.193 | |
| SCIDOCS-NL.v2 | 0.1309 | 0.0773 | |
| SciFact-NL.v2 | 0.6840 | 0.4234 | |
| Average | 0.4457 | 0.3292 |
|
Interesting difference for |
There are some fluctuations in the results, but the biggest one is for BelebeleRetrieval: 0.7760 (7fc06782350c1a83f88b15dd4b38ef853d3b8503) vs 0.8920 (ca1791e0bcc104f6db161f27de1340241b13c5a4) However, when I check the old file directly, the results are the same. |
|
Yes, I see. Maybe then problem in table generation script |
Checklist
mteb/models/this can be as an API. Instruction on how to add a model can be found here