Add e5-nl results on MTEB-NL#351
Conversation
Model Results ComparisonReference models: Results for
|
| task_name | clips/e5-base-trm-nl | google/gemini-embedding-001 | intfloat/multilingual-e5-large | Max result |
|---|---|---|---|---|
| ArguAna-NL.v2 | 0.4634 | nan | 0.4894 | 0.5603 |
| BelebeleRetrieval | 0.9380 | 0.9073 | 0.7791 | 0.9167 |
| CovidDisinformationNLMultiLabelClassification | 0.4873 | nan | 0.4970 | 0.5361 |
| DutchBookReviewSentimentClassification.v2 | 0.6735 | nan | 0.6256 | 0.9228 |
| DutchColaClassification | 0.5611 | nan | 0.5676 | 0.5684 |
| DutchGovernmentBiasClassification | 0.6094 | nan | 0.6193 | 0.6193 |
| DutchNewsArticlesClassification | 0.5679 | nan | 0.5781 | 0.6236 |
| DutchNewsArticlesClusteringP2P | 0.3940 | nan | 0.4045 | 0.4742 |
| DutchNewsArticlesClusteringS2S | 0.2833 | nan | 0.2601 | 0.3471 |
| DutchNewsArticlesRetrieval | 0.6652 | nan | 0.7459 | 0.8200 |
| DutchSarcasticHeadlinesClassification | 0.6636 | nan | 0.7281 | 0.7281 |
| IconclassClassification | 0.5399 | nan | 0.5134 | 0.5724 |
| IconclassClusteringS2S | 0.2550 | nan | 0.2220 | 0.3077 |
| LegalQANLRetrieval | 0.6725 | nan | 0.7748 | 0.8267 |
| MassiveIntentClassification | 0.6273 | 0.8192 | 0.6591 | 0.9194 |
| MassiveScenarioClassification | 0.6876 | 0.8730 | 0.7012 | 0.9930 |
| MultiEURLEXMultilabelClassification | 0.0519 | 0.0528 | 0.0516 | 0.0561 |
| MultiHateClassification | 0.5806 | 0.7247 | 0.6357 | 0.8374 |
| NFCorpus-NL.v2 | 0.2808 | nan | 0.2982 | 0.3301 |
| OpenTenderClassification | 0.4420 | nan | 0.4193 | 0.5166 |
| OpenTenderClusteringP2P | 0.3442 | nan | 0.2301 | 0.5051 |
| OpenTenderClusteringS2S | 0.2743 | nan | 0.1617 | 0.4659 |
| OpenTenderRetrieval | 0.3925 | nan | 0.3778 | 0.4871 |
| SCIDOCS-NL.v2 | 0.1429 | nan | 0.1309 | 0.1833 |
| SIB200Classification | 0.7452 | nan | 0.7339 | 0.7968 |
| SIB200ClusteringS2S | 0.4121 | 0.4174 | 0.3945 | 0.5067 |
| SICK-NL-STS | 0.7375 | nan | 0.7692 | 0.8855 |
| SICKNLPairClassification | 0.8999 | nan | 0.9332 | 0.9711 |
| STSBenchmarkMultilingualSTS | 0.8066 | nan | 0.8349 | 0.9554 |
| SciFact-NL.v2 | 0.6683 | nan | 0.6840 | 0.6958 |
| VABBClusteringP2P | 0.4234 | nan | 0.3437 | 0.5769 |
| VABBClusteringS2S | 0.3532 | nan | 0.3071 | 0.4452 |
| VABBMultiLabelClassification | 0.5389 | nan | 0.5233 | 0.5611 |
| VABBRetrieval | 0.7285 | nan | 0.7036 | 0.8100 |
| VaccinChatNLClassification | 0.4959 | nan | 0.5063 | 0.5768 |
| WebFAQRetrieval | 0.7430 | nan | 0.8072 | 0.8571 |
| WikipediaRerankingMultilingual | 0.8738 | 0.9224 | 0.8981 | 0.9308 |
| WikipediaRetrievalMultilingual | 0.8906 | 0.9420 | 0.9111 | 0.9420 |
| XLWICNLPairClassification | 0.6676 | nan | 0.6732 | 0.6956 |
| bBSARDNLRetrieval | 0.1987 | nan | 0.2384 | 0.3128 |
| Average | 0.5445 | 0.7074 | 0.5433 | 0.6409 |
Model have high performance on these tasks: BelebeleRetrieval
Results for clips/e5-large-trm-nl
| task_name | clips/e5-large-trm-nl | google/gemini-embedding-001 | intfloat/multilingual-e5-large | Max result |
|---|---|---|---|---|
| ArguAna-NL.v2 | 0.4713 | nan | 0.4894 | 0.5603 |
| BelebeleRetrieval | 0.9305 | 0.9073 | 0.7791 | 0.9167 |
| CovidDisinformationNLMultiLabelClassification | 0.5159 | nan | 0.4970 | 0.5361 |
| DutchBookReviewSentimentClassification.v2 | 0.7297 | nan | 0.6256 | 0.9228 |
| DutchColaClassification | 0.5592 | nan | 0.5676 | 0.5684 |
| DutchGovernmentBiasClassification | 0.6102 | nan | 0.6193 | 0.6193 |
| DutchNewsArticlesClassification | 0.5819 | nan | 0.5781 | 0.6236 |
| DutchNewsArticlesClusteringP2P | 0.4073 | nan | 0.4045 | 0.4742 |
| DutchNewsArticlesClusteringS2S | 0.3043 | nan | 0.2601 | 0.3471 |
| DutchNewsArticlesRetrieval | 0.7104 | nan | 0.7459 | 0.8200 |
| DutchSarcasticHeadlinesClassification | 0.7396 | nan | 0.7281 | 0.7281 |
| IconclassClassification | 0.5314 | nan | 0.5134 | 0.5724 |
| IconclassClusteringS2S | 0.2492 | nan | 0.2220 | 0.3077 |
| LegalQANLRetrieval | 0.7156 | nan | 0.7748 | 0.8267 |
| MassiveIntentClassification | 0.6510 | 0.8192 | 0.6591 | 0.9194 |
| MassiveScenarioClassification | 0.7081 | 0.8730 | 0.7012 | 0.9930 |
| MultiEURLEXMultilabelClassification | 0.0650 | 0.0528 | 0.0516 | 0.0561 |
| MultiHateClassification | 0.6520 | 0.7247 | 0.6357 | 0.8374 |
| NFCorpus-NL.v2 | 0.3080 | nan | 0.2982 | 0.3301 |
| OpenTenderClassification | 0.4713 | nan | 0.4193 | 0.5166 |
| OpenTenderClusteringP2P | 0.3681 | nan | 0.2301 | 0.5051 |
| OpenTenderClusteringS2S | 0.2925 | nan | 0.1617 | 0.4659 |
| OpenTenderRetrieval | 0.4250 | nan | 0.3778 | 0.4871 |
| SCIDOCS-NL.v2 | 0.1593 | nan | 0.1309 | 0.1833 |
| SIB200Classification | 0.7517 | nan | 0.7339 | 0.7968 |
| SIB200ClusteringS2S | 0.4362 | 0.4174 | 0.3945 | 0.5067 |
| SICK-NL-STS | 0.7576 | nan | 0.7692 | 0.8855 |
| SICKNLPairClassification | 0.9530 | nan | 0.9332 | 0.9711 |
| STSBenchmarkMultilingualSTS | 0.8280 | nan | 0.8349 | 0.9554 |
| SciFact-NL.v2 | 0.6391 | nan | 0.6840 | 0.6958 |
| VABBClusteringP2P | 0.4564 | nan | 0.3437 | 0.5769 |
| VABBClusteringS2S | 0.3502 | nan | 0.3071 | 0.4452 |
| VABBMultiLabelClassification | 0.5551 | nan | 0.5233 | 0.5611 |
| VABBRetrieval | 0.7622 | nan | 0.7036 | 0.8100 |
| VaccinChatNLClassification | 0.5141 | nan | 0.5063 | 0.5768 |
| WebFAQRetrieval | 0.7425 | nan | 0.8072 | 0.8571 |
| WikipediaRerankingMultilingual | 0.8718 | 0.9224 | 0.8981 | 0.9308 |
| WikipediaRetrievalMultilingual | 0.8883 | 0.9420 | 0.9111 | 0.9420 |
| XLWICNLPairClassification | 0.6754 | nan | 0.6732 | 0.6956 |
| bBSARDNLRetrieval | 0.2268 | nan | 0.2384 | 0.3128 |
| Average | 0.5641 | 0.7074 | 0.5433 | 0.6409 |
Model have high performance on these tasks: BelebeleRetrieval,DutchSarcasticHeadlinesClassification,MultiEURLEXMultilabelClassification
Results for clips/e5-small-trm-nl
| task_name | clips/e5-small-trm-nl | google/gemini-embedding-001 | intfloat/multilingual-e5-large | Max result |
|---|---|---|---|---|
| ArguAna-NL.v2 | 0.4628 | nan | 0.4894 | 0.5603 |
| BelebeleRetrieval | 0.9244 | 0.9073 | 0.7791 | 0.9167 |
| CovidDisinformationNLMultiLabelClassification | 0.4916 | nan | 0.4970 | 0.5361 |
| DutchBookReviewSentimentClassification.v2 | 0.6255 | nan | 0.6256 | 0.9228 |
| DutchColaClassification | 0.5495 | nan | 0.5676 | 0.5684 |
| DutchGovernmentBiasClassification | 0.6146 | nan | 0.6193 | 0.6193 |
| DutchNewsArticlesClassification | 0.5749 | nan | 0.5781 | 0.6236 |
| DutchNewsArticlesClusteringP2P | 0.4146 | nan | 0.4045 | 0.4742 |
| DutchNewsArticlesClusteringS2S | 0.2647 | nan | 0.2601 | 0.3471 |
| DutchNewsArticlesRetrieval | 0.6664 | nan | 0.7459 | 0.8200 |
| DutchSarcasticHeadlinesClassification | 0.6645 | nan | 0.7281 | 0.7281 |
| IconclassClassification | 0.5182 | nan | 0.5134 | 0.5724 |
| IconclassClusteringS2S | 0.2257 | nan | 0.2220 | 0.3077 |
| LegalQANLRetrieval | 0.7118 | nan | 0.7748 | 0.8267 |
| MassiveIntentClassification | 0.5980 | 0.8192 | 0.6591 | 0.9194 |
| MassiveScenarioClassification | 0.6719 | 0.8730 | 0.7012 | 0.9930 |
| MultiEURLEXMultilabelClassification | 0.0504 | 0.0528 | 0.0516 | 0.0561 |
| MultiHateClassification | 0.5731 | 0.7247 | 0.6357 | 0.8374 |
| NFCorpus-NL.v2 | 0.2918 | nan | 0.2982 | 0.3301 |
| OpenTenderClassification | 0.4347 | nan | 0.4193 | 0.5166 |
| OpenTenderClusteringP2P | 0.3234 | nan | 0.2301 | 0.5051 |
| OpenTenderClusteringS2S | 0.2320 | nan | 0.1617 | 0.4659 |
| OpenTenderRetrieval | 0.4154 | nan | 0.3778 | 0.4871 |
| SCIDOCS-NL.v2 | 0.1345 | nan | 0.1309 | 0.1833 |
| SIB200Classification | 0.7282 | nan | 0.7339 | 0.7968 |
| SIB200ClusteringS2S | 0.3887 | 0.4174 | 0.3945 | 0.5067 |
| SICK-NL-STS | 0.7223 | nan | 0.7692 | 0.8855 |
| SICKNLPairClassification | 0.8805 | nan | 0.9332 | 0.9711 |
| STSBenchmarkMultilingualSTS | 0.7948 | nan | 0.8349 | 0.9554 |
| SciFact-NL.v2 | 0.6457 | nan | 0.6840 | 0.6958 |
| VABBClusteringP2P | 0.4266 | nan | 0.3437 | 0.5769 |
| VABBClusteringS2S | 0.3470 | nan | 0.3071 | 0.4452 |
| VABBMultiLabelClassification | 0.5216 | nan | 0.5233 | 0.5611 |
| VABBRetrieval | 0.7312 | nan | 0.7036 | 0.8100 |
| VaccinChatNLClassification | 0.4518 | nan | 0.5063 | 0.5768 |
| WebFAQRetrieval | 0.7247 | nan | 0.8072 | 0.8571 |
| WikipediaRerankingMultilingual | 0.8709 | 0.9224 | 0.8981 | 0.9308 |
| WikipediaRetrievalMultilingual | 0.8869 | 0.9420 | 0.9111 | 0.9420 |
| XLWICNLPairClassification | 0.6397 | nan | 0.6732 | 0.6956 |
| bBSARDNLRetrieval | 0.1269 | nan | 0.2384 | 0.3128 |
| Average | 0.5330 | 0.7074 | 0.5433 | 0.6409 |
Model have high performance on these tasks: BelebeleRetrieval
|
Looks good maybe with the exception of |
The results for the nld subset are high for all models. |
|
Ahh @Samoed should we make an issue on this? (sounds like you know where to look) |
|
Currently, I don't know, but yes we should create an issue |
Checklist
mteb/models/model_implementations/, this can be as an API. Instruction on how to add a model can be found here