Skip to content

Conversation

@mehran-sarmadi
Copy link
Contributor

for embeddings-benchmark/mteb#3157

Checklist

  • The results submitted is obtained using the reference implementation
  • My model is available, either as a publicly accessible API or publicly on e.g., Huggingface
  • I solemnly swear that for all results submitted I have not trained on the evaluation dataset including training splits. If I have I have disclosed it clearly.

@KennethEnevoldsen KennethEnevoldsen added the waiting for review of implementation This PR is waiting for an implementation review before merging the results. label Sep 9, 2025
@github-actions
Copy link

Model Results Comparison

Reference models: intfloat/multilingual-e5-large, google/gemini-embedding-001
New models evaluated: Alibaba-NLP/gte-Qwen2-7B-instruct, BAAI/bge-m3, HooshvareLab/bert-base-parsbert-uncased, MCINext/Hakim-small, MCINext/Hakim-unsup, MCINext/Hakim, PartAI/Tooka-SBERT-V2-Large, PartAI/Tooka-SBERT-V2-Small, PartAI/Tooka-SBERT, PartAI/TookaBERT-Base, google/embeddinggemma-300m, intfloat/e5-mistral-7b-instruct, intfloat/multilingual-e5-base, intfloat/multilingual-e5-large, jinaai/jina-embeddings-v3, m3hrdadfi/bert-zwnj-wnli-mean-tokens, m3hrdadfi/roberta-zwnj-wnli-mean-tokens, myrkur/sentence-transformer-parsbert-fa, openai/text-embedding-3-small, sbunlp/fabert, sentence-transformers/LaBSE, sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
Tasks: ArguAna-Fa.v2, BeytooteClustering, DeepSentiPers.v2, DigikalamagClassification, DigikalamagClustering, FEVER-FaHardNegatives, FarsTail, FarsiParaphraseDetection, Farsick, FiQA2018-Fa.v2, HamshahriClustring, HotpotQA-FaHardNegatives, MIRACLReranking, MIRACLRetrievalHardNegatives, MSMARCO-FaHardNegatives, MassiveIntentClassification, MassiveScenarioClassification, NLPTwitterAnalysisClassification.v2, NLPTwitterAnalysisClustering, NQ-FaHardNegatives, NeuCLIR2023RetrievalHardNegatives, ParsinluEntail, ParsinluQueryParaphPC, PerShopDomainClassification, PerShopIntentClassification, PersianFoodSentimentClassification, PersianTextEmotion.v2, PersianWebDocumentRetrieval, QuoraRetrieval-Fa.v2, SAMSumFa, SCIDOCS-Fa.v2, SIDClassification.v2, SIDClustring, SciFact-Fa.v2, StyleClassification, SynPerChatbotConvSAAnger, SynPerChatbotConvSAClassification, SynPerChatbotConvSAFear, SynPerChatbotConvSAFriendship, SynPerChatbotConvSAHappiness, SynPerChatbotConvSAJealousy, SynPerChatbotConvSALove, SynPerChatbotConvSASadness, SynPerChatbotConvSASatisfaction, SynPerChatbotConvSASurprise, SynPerChatbotConvSAToneChatbotClassification, SynPerChatbotConvSAToneUserClassification, SynPerChatbotRAGFAQPC, SynPerChatbotRAGFAQRetrieval, SynPerChatbotRAGSumSRetrieval, SynPerChatbotSatisfactionLevelClassification, SynPerChatbotSumSRetrieval, SynPerQAPC, SynPerQARetrieval, SynPerSTS, SynPerTextKeywordsPC, SynPerTextToneClassification.v3, TRECCOVID-Fa.v2, Touche2020-Fa.v2, WebFAQRetrieval, WikipediaRerankingMultilingual, WikipediaRetrievalMultilingual

Results for Alibaba-NLP/gte-Qwen2-7B-instruct

task_name Alibaba-NLP/gte-Qwen2-7B-instruct intfloat/multilingual-e5-large Max result
ArguAna-Fa.v2 0.4259 0.4127
DeepSentiPers.v2 0.5794 0.5769
FEVER-FaHardNegatives 0.6967 0.4615
FiQA2018-Fa.v2 0.3274 0.2946
HotpotQA-FaHardNegatives 0.5159 0.6153
MSMARCO-FaHardNegatives 0.6295 0.6871
NLPTwitterAnalysisClassification.v2 0.7877 0.7659
NQ-FaHardNegatives 0.4559 0.4983
NeuCLIR2023RetrievalHardNegatives 0.5953 0.5059 0.5950
PerShopDomainClassification 0.5851 0.5517
PerShopIntentClassification 0.8809 0.9069
PersianTextEmotion.v2 0.4196 0.6091
QuoraRetrieval-Fa.v2 0.7500 0.7788
SCIDOCS-Fa.v2 0.1622 0.1222
SIDClassification.v2 0.6263 0.6137
SciFact-Fa.v2 0.6472 0.6037
StyleClassification 0.5943 0.6492
SynPerTextToneClassification.v3 0.6505 0.8412
TRECCOVID-Fa.v2 0.7496 0.7177
Touche2020-Fa.v2 0.4224 0.4978
WebFAQRetrieval 0.7127 0.7459 0.7813
Average 0.5816 0.5931 0.6881

Model have high performance on these tasks: NeuCLIR2023RetrievalHardNegatives


Results for BAAI/bge-m3

task_name BAAI/bge-m3 intfloat/multilingual-e5-large Max result
ArguAna-Fa.v2 0.5403 0.4127
DeepSentiPers.v2 0.6678 0.5769
FEVER-FaHardNegatives 0.6421 0.4615
FiQA2018-Fa.v2 0.3023 0.2946
HotpotQA-FaHardNegatives 0.5762 0.6153
MSMARCO-FaHardNegatives 0.6847 0.6871
NLPTwitterAnalysisClassification.v2 0.7726 0.7659
NQ-FaHardNegatives 0.5021 0.4983
PerShopDomainClassification 0.6646 0.5517
PerShopIntentClassification 0.8988 0.9069
PersianTextEmotion.v2 0.5981 0.6091
QuoraRetrieval-Fa.v2 0.8031 0.7788
SCIDOCS-Fa.v2 0.1499 0.1222
SIDClassification.v2 0.5962 0.6137
SciFact-Fa.v2 0.5858 0.6037
StyleClassification 0.5586 0.6492
SynPerTextToneClassification.v3 0.7263 0.8412
TRECCOVID-Fa.v2 0.7338 0.7177
Touche2020-Fa.v2 0.4853 0.4978
WebFAQRetrieval 0.7726 0.7459 0.7813
Average 0.6131 0.5975 0.7813

Results for HooshvareLab/bert-base-parsbert-uncased

task_name HooshvareLab/bert-base-parsbert-uncased google/gemini-embedding-001 intfloat/multilingual-e5-large Max result
ArguAna-Fa.v2 0.1969 nan 0.4127
DeepSentiPers.v2 0.4971 nan 0.5769
FEVER-FaHardNegatives 0.0170 nan 0.4615
FiQA2018-Fa.v2 0.0153 nan 0.2946
HotpotQA-FaHardNegatives 0.0630 nan 0.6153
MIRACLRetrievalHardNegatives 0.0775 0.6163 0.5923 0.6257
MSMARCO-FaHardNegatives 0.2492 nan 0.6871
NLPTwitterAnalysisClassification.v2 0.7153 nan 0.7659
NQ-FaHardNegatives 0.0456 nan 0.4983
NeuCLIR2023RetrievalHardNegatives 0.1171 nan 0.5059 0.5950
PerShopDomainClassification 0.6962 nan 0.5517
PerShopIntentClassification 0.9168 nan 0.9069
PersianTextEmotion.v2 0.4763 nan 0.6091
QuoraRetrieval-Fa.v2 0.4848 nan 0.7788
SCIDOCS-Fa.v2 0.0161 nan 0.1222
SIDClassification.v2 0.5571 nan 0.6137
SciFact-Fa.v2 0.0834 nan 0.6037
StyleClassification 0.9586 nan 0.6492
SynPerTextToneClassification.v3 0.9745 nan 0.8412
TRECCOVID-Fa.v2 0.0867 nan 0.7177
Touche2020-Fa.v2 0.0443 nan 0.4978
WebFAQRetrieval 0.1822 nan 0.7459 0.7813
Average 0.3396 0.6163 0.5931 0.6673

Results for MCINext/Hakim-small

task_name MCINext/Hakim-small google/gemini-embedding-001 intfloat/multilingual-e5-large Max result
ArguAna-Fa.v2 0.4268 nan 0.4127
DeepSentiPers.v2 0.6527 nan 0.5769
FEVER-FaHardNegatives 0.5176 nan 0.4615
FiQA2018-Fa.v2 0.1912 nan 0.2946
HotpotQA-FaHardNegatives 0.4450 nan 0.6153
MIRACLRetrievalHardNegatives 0.4488 0.6163 0.5923 0.6257
MSMARCO-FaHardNegatives 0.6399 nan 0.6871
NLPTwitterAnalysisClassification.v2 0.7819 nan 0.7659
NQ-FaHardNegatives 0.3085 nan 0.4983
NeuCLIR2023RetrievalHardNegatives 0.4937 nan 0.5059 0.5950
PerShopDomainClassification 0.6896 nan 0.5517
PerShopIntentClassification 0.8633 nan 0.9069
PersianTextEmotion.v2 0.7719 nan 0.6091
QuoraRetrieval-Fa.v2 0.7197 nan 0.7788
SCIDOCS-Fa.v2 0.0981 nan 0.1222
SIDClassification.v2 0.6463 nan 0.6137
SciFact-Fa.v2 0.4977 nan 0.6037
StyleClassification 0.7969 nan 0.6492
SynPerTextToneClassification.v3 0.9184 nan 0.8412
TRECCOVID-Fa.v2 0.4367 nan 0.7177
Touche2020-Fa.v2 0.3738 nan 0.4978
WebFAQRetrieval 0.6934 nan 0.7459 0.7813
Average 0.5642 0.6163 0.5931 0.6673

Results for MCINext/Hakim-unsup

task_name MCINext/Hakim-unsup google/gemini-embedding-001 intfloat/multilingual-e5-large Max result
ArguAna-Fa.v2 0.4020 nan 0.4127
DeepSentiPers.v2 0.6494 nan 0.5769
FEVER-FaHardNegatives 0.3961 nan 0.4615
FiQA2018-Fa.v2 0.1705 nan 0.2946
HotpotQA-FaHardNegatives 0.4381 nan 0.6153
MIRACLRetrievalHardNegatives 0.5143 0.6163 0.5923 0.6257
MSMARCO-FaHardNegatives 0.6082 nan 0.6871
NLPTwitterAnalysisClassification.v2 0.7669 nan 0.7659
NQ-FaHardNegatives 0.3514 nan 0.4983
NeuCLIR2023RetrievalHardNegatives 0.5333 nan 0.5059 0.5950
PerShopDomainClassification 0.7181 nan 0.5517
PerShopIntentClassification 0.8907 nan 0.9069
PersianTextEmotion.v2 0.6460 nan 0.6091
QuoraRetrieval-Fa.v2 0.7592 nan 0.7788
SCIDOCS-Fa.v2 0.1261 nan 0.1222
SIDClassification.v2 0.6022 nan 0.6137
SciFact-Fa.v2 0.4874 nan 0.6037
StyleClassification 0.7484 nan 0.6492
SynPerTextToneClassification.v3 0.8072 nan 0.8412
TRECCOVID-Fa.v2 0.5779 nan 0.7177
WebFAQRetrieval 0.6611 nan 0.7459 0.7813
Average 0.5645 0.6163 0.5976 0.6673

Results for MCINext/Hakim

task_name MCINext/Hakim google/gemini-embedding-001 intfloat/multilingual-e5-large Max result
ArguAna-Fa.v2 0.4613 nan 0.4127
DeepSentiPers.v2 0.7227 nan 0.5769
FEVER-FaHardNegatives 0.5014 nan 0.4615
FiQA2018-Fa.v2 0.2446 nan 0.2946
HotpotQA-FaHardNegatives 0.4799 nan 0.6153
MIRACLRetrievalHardNegatives 0.4725 0.6163 0.5923 0.6257
MSMARCO-FaHardNegatives 0.6472 nan 0.6871
NLPTwitterAnalysisClassification.v2 0.8001 nan 0.7659
NQ-FaHardNegatives 0.3475 nan 0.4983
NeuCLIR2023RetrievalHardNegatives 0.4933 nan 0.5059 0.5950
PerShopDomainClassification 0.6669 nan 0.5517
PerShopIntentClassification 0.8792 nan 0.9069
PersianTextEmotion.v2 0.8645 nan 0.6091
QuoraRetrieval-Fa.v2 0.7457 nan 0.7788
SCIDOCS-Fa.v2 0.1050 nan 0.1222
SIDClassification.v2 0.6845 nan 0.6137
SciFact-Fa.v2 0.5379 nan 0.6037
StyleClassification 0.6896 nan 0.6492
SynPerTextToneClassification.v3 0.9388 nan 0.8412
TRECCOVID-Fa.v2 0.5485 nan 0.7177
Touche2020-Fa.v2 0.3975 nan 0.4978
WebFAQRetrieval 0.7388 nan 0.7459 0.7813
Average 0.5894 0.6163 0.5931 0.6673

Results for PartAI/Tooka-SBERT-V2-Large

task_name PartAI/Tooka-SBERT-V2-Large google/gemini-embedding-001 intfloat/multilingual-e5-large Max result
ArguAna-Fa.v2 0.4369 nan 0.4127
DeepSentiPers.v2 0.6564 nan 0.5769
FEVER-FaHardNegatives 0.2492 nan 0.4615
FiQA2018-Fa.v2 0.1921 nan 0.2946
HotpotQA-FaHardNegatives 0.3533 nan 0.6153
MIRACLRetrievalHardNegatives 0.4854 0.6163 0.5923 0.6257
MSMARCO-FaHardNegatives 0.5982 nan 0.6871
NLPTwitterAnalysisClassification.v2 0.7780 nan 0.7659
NQ-FaHardNegatives 0.3190 nan 0.4983
NeuCLIR2023RetrievalHardNegatives 0.5561 nan 0.5059 0.5950
PerShopDomainClassification 0.7607 nan 0.5517
PerShopIntentClassification 0.8914 nan 0.9069
PersianTextEmotion.v2 0.5685 nan 0.6091
QuoraRetrieval-Fa.v2 0.7743 nan 0.7788
SCIDOCS-Fa.v2 0.1160 nan 0.1222
SIDClassification.v2 0.5535 nan 0.6137
SciFact-Fa.v2 0.4103 nan 0.6037
StyleClassification 0.8471 nan 0.6492
SynPerTextToneClassification.v3 0.9294 nan 0.8412
TRECCOVID-Fa.v2 0.6676 nan 0.7177
Touche2020-Fa.v2 0.4374 nan 0.4978
WebFAQRetrieval 0.6641 nan 0.7459 0.7813
Average 0.5566 0.6163 0.5931 0.6673

Results for PartAI/Tooka-SBERT-V2-Small

task_name PartAI/Tooka-SBERT-V2-Small google/gemini-embedding-001 intfloat/multilingual-e5-large Max result
ArguAna-Fa.v2 0.4564 nan 0.4127
DeepSentiPers.v2 0.6087 nan 0.5769
FEVER-FaHardNegatives 0.3968 nan 0.4615
FiQA2018-Fa.v2 0.1687 nan 0.2946
HotpotQA-FaHardNegatives 0.3601 nan 0.6153
MIRACLRetrievalHardNegatives 0.5306 0.6163 0.5923 0.6257
MSMARCO-FaHardNegatives 0.5802 nan 0.6871
NLPTwitterAnalysisClassification.v2 0.7717 nan 0.7659
NQ-FaHardNegatives 0.3377 nan 0.4983
NeuCLIR2023RetrievalHardNegatives 0.5524 nan 0.5059 0.5950
PerShopDomainClassification 0.7465 nan 0.5517
PerShopIntentClassification 0.8816 nan 0.9069
PersianTextEmotion.v2 0.5251 nan 0.6091
QuoraRetrieval-Fa.v2 0.7474 nan 0.7788
SCIDOCS-Fa.v2 0.1166 nan 0.1222
SIDClassification.v2 0.5459 nan 0.6137
SciFact-Fa.v2 0.4138 nan 0.6037
StyleClassification 0.8768 nan 0.6492
SynPerTextToneClassification.v3 0.8429 nan 0.8412
TRECCOVID-Fa.v2 0.6560 nan 0.7177
Touche2020-Fa.v2 0.4269 nan 0.4978
WebFAQRetrieval 0.6332 nan 0.7459 0.7813
Average 0.5534 0.6163 0.5931 0.6673

Results for PartAI/Tooka-SBERT

task_name PartAI/Tooka-SBERT google/gemini-embedding-001 intfloat/multilingual-e5-large Max result
ArguAna-Fa.v2 0.3253 nan 0.4127
DeepSentiPers.v2 0.6345 nan 0.5769
FEVER-FaHardNegatives 0.1515 nan 0.4615
FiQA2018-Fa.v2 0.1267 nan 0.2946
HotpotQA-FaHardNegatives 0.2374 nan 0.6153
MIRACLRetrievalHardNegatives 0.2643 0.6163 0.5923 0.6257
MSMARCO-FaHardNegatives 0.4732 nan 0.6871
NLPTwitterAnalysisClassification.v2 0.7563 nan 0.7659
NQ-FaHardNegatives 0.1804 nan 0.4983
NeuCLIR2023RetrievalHardNegatives 0.4927 nan 0.5059 0.5950
PerShopDomainClassification 0.7372 nan 0.5517
PerShopIntentClassification 0.8810 nan 0.9069
PersianTextEmotion.v2 0.5682 nan 0.6091
QuoraRetrieval-Fa.v2 0.7588 nan 0.7788
SCIDOCS-Fa.v2 0.0973 nan 0.1222
SIDClassification.v2 0.5325 nan 0.6137
SciFact-Fa.v2 0.3798 nan 0.6037
StyleClassification 0.7591 nan 0.6492
SynPerTextToneClassification.v3 0.7462 nan 0.8412
TRECCOVID-Fa.v2 0.5796 nan 0.7177
Touche2020-Fa.v2 0.3149 nan 0.4978
WebFAQRetrieval 0.5454 nan 0.7459 0.7813
Average 0.4792 0.6163 0.5931 0.6673

Results for PartAI/TookaBERT-Base

task_name PartAI/TookaBERT-Base google/gemini-embedding-001 intfloat/multilingual-e5-large Max result
ArguAna-Fa.v2 0.2671 nan 0.4127
DeepSentiPers.v2 0.5406 nan 0.5769
FEVER-FaHardNegatives 0.0081 nan 0.4615
FiQA2018-Fa.v2 0.0229 nan 0.2946
HotpotQA-FaHardNegatives 0.0494 nan 0.6153
MIRACLRetrievalHardNegatives 0.0521 0.6163 0.5923 0.6257
MSMARCO-FaHardNegatives 0.2741 nan 0.6871
NLPTwitterAnalysisClassification.v2 0.7149 nan 0.7659
NQ-FaHardNegatives 0.0343 nan 0.4983
NeuCLIR2023RetrievalHardNegatives 0.2021 nan 0.5059 0.5950
PerShopDomainClassification 0.6516 nan 0.5517
PerShopIntentClassification 0.9021 nan 0.9069
PersianTextEmotion.v2 0.5288 nan 0.6091
QuoraRetrieval-Fa.v2 0.5060 nan 0.7788
SCIDOCS-Fa.v2 0.0330 nan 0.1222
SIDClassification.v2 0.5876 nan 0.6137
SciFact-Fa.v2 0.1684 nan 0.6037
StyleClassification 0.9641 nan 0.6492
SynPerTextToneClassification.v3 0.9851 nan 0.8412
TRECCOVID-Fa.v2 0.1120 nan 0.7177
Touche2020-Fa.v2 0.0330 nan 0.4978
WebFAQRetrieval 0.2016 nan 0.7459 0.7813
Average 0.3563 0.6163 0.5931 0.6673

Results for google/embeddinggemma-300m

task_name google/embeddinggemma-300m intfloat/multilingual-e5-large Max result
ArguAna-Fa.v2 0.5849 0.4127
BeytooteClustering 0.6510 0.6150 0.6252
DeepSentiPers.v2 0.6027 0.5769
DigikalamagClassification 0.8779 0.8705 0.8631
DigikalamagClustering 0.5143 0.3989 0.4748
FEVER-FaHardNegatives 0.6769 0.4615
FarsTail 0.7703 0.7255 0.7478
FarsiParaphraseDetection 0.9623 0.9757 0.9706
Farsick 0.7108 0.7067 0.7095
FiQA2018-Fa.v2 0.2880 0.2946
HamshahriClustring 0.7207 0.6742 0.6983
HotpotQA-FaHardNegatives 0.5481 0.6153
MIRACLReranking 0.5250 0.5936 0.6026
MSMARCO-FaHardNegatives 0.6236 0.6871
NLPTwitterAnalysisClassification.v2 0.7791 0.7659
NLPTwitterAnalysisClustering 0.8112 0.7848 0.8082
NQ-FaHardNegatives 0.4418 0.4983
NeuCLIR2023RetrievalHardNegatives 0.6225 0.5059 0.5950
ParsinluEntail 0.7103 0.6546 0.6655
ParsinluQueryParaphPC 0.8841 0.8783 0.8709
PerShopDomainClassification 0.6273 0.5517
PerShopIntentClassification 0.9061 0.9069
PersianFoodSentimentClassification 0.8254 0.8212 0.8105
PersianTextEmotion.v2 0.5655 0.6091
PersianWebDocumentRetrieval 0.5282 0.4676 0.5067
QuoraRetrieval-Fa.v2 0.7430 0.7788
SAMSumFa 0.9812 0.9242 0.9247
SCIDOCS-Fa.v2 0.1445 0.1222
SIDClassification.v2 0.6327 0.6137
SIDClustring 0.4858 0.3865 0.4102
SciFact-Fa.v2 0.6523 0.6037
StyleClassification 0.6258 0.6492
SynPerChatbotConvSAAnger 0.8237 0.7193 0.8661
SynPerChatbotConvSAClassification 0.6979 0.6077 0.7472
SynPerChatbotConvSAFear 0.8376 0.7419 0.7769
SynPerChatbotConvSAFriendship 0.5420 0.5283 0.6268
SynPerChatbotConvSAHappiness 0.6263 0.5246 0.7398
SynPerChatbotConvSAJealousy 0.7517 0.7034 0.7621
SynPerChatbotConvSALove 0.5286 0.4629 0.6086
SynPerChatbotConvSASadness 0.7598 0.6441 0.8167
SynPerChatbotConvSASatisfaction 0.7879 0.6058 0.8079
SynPerChatbotConvSASurprise 0.6240 0.5388 0.7198
SynPerChatbotConvSAToneChatbotClassification 0.6458 0.5807 0.8198
SynPerChatbotConvSAToneUserClassification 0.5450 0.5260 0.6197
SynPerChatbotRAGFAQPC 0.6659 0.6303 0.6677
SynPerChatbotRAGFAQRetrieval 0.3401 0.2348 0.4405
SynPerChatbotRAGSumSRetrieval 0.6593 0.4981 0.6037
SynPerChatbotSatisfactionLevelClassification 0.3197 0.2523 0.3343
SynPerChatbotSumSRetrieval 0.4457 0.2760 0.3678
SynPerQAPC 0.9355 0.9516 0.9320
SynPerQARetrieval 0.8628 0.8735 0.8681
SynPerSTS 0.8707 0.8798 0.8691
SynPerTextKeywordsPC 0.9650 0.9479 0.9640
SynPerTextToneClassification.v3 0.7377 0.8412
TRECCOVID-Fa.v2 0.7341 0.7177
WebFAQRetrieval 0.7806 0.7459 0.7813
Average 0.6698 0.6279 0.7112

Model have high performance on these tasks: BeytooteClustering,DigikalamagClassification,DigikalamagClustering,FarsTail,Farsick,HamshahriClustring,NLPTwitterAnalysisClustering,NeuCLIR2023RetrievalHardNegatives,ParsinluEntail,ParsinluQueryParaphPC,PersianFoodSentimentClassification,PersianWebDocumentRetrieval,SAMSumFa,SIDClustring,SynPerChatbotConvSAFear,SynPerChatbotRAGSumSRetrieval,SynPerChatbotSumSRetrieval,SynPerQAPC,SynPerSTS,SynPerTextKeywordsPC


Results for intfloat/e5-mistral-7b-instruct

task_name intfloat/e5-mistral-7b-instruct intfloat/multilingual-e5-large Max result
ArguAna-Fa.v2 0.4652 0.4127
DeepSentiPers.v2 0.5474 0.5769
FEVER-FaHardNegatives 0.6563 0.4615
FiQA2018-Fa.v2 0.2293 0.2946
HotpotQA-FaHardNegatives 0.4643 0.6153
MSMARCO-FaHardNegatives 0.6256 0.6871
NLPTwitterAnalysisClassification.v2 0.7582 0.7659
NQ-FaHardNegatives 0.3873 0.4983
PerShopDomainClassification 0.2709 0.5517
PerShopIntentClassification 0.8724 0.9069
PersianTextEmotion.v2 0.3933 0.6091
QuoraRetrieval-Fa.v2 0.7783 0.7788
SCIDOCS-Fa.v2 0.1404 0.1222
SIDClassification.v2 0.5819 0.6137
SciFact-Fa.v2 0.5552 0.6037
StyleClassification 0.5604 0.6492
SynPerTextToneClassification.v3 0.6149 0.8412
TRECCOVID-Fa.v2 0.6498 0.7177
Touche2020-Fa.v2 0.4352 0.4978
WebFAQRetrieval 0.7224 0.7459 0.7813
Average 0.5354 0.5975 0.7813

Results for intfloat/multilingual-e5-base

task_name intfloat/multilingual-e5-base intfloat/multilingual-e5-large Max result
ArguAna-Fa.v2 0.3384 0.4127
DeepSentiPers.v2 0.5897 0.5769
FEVER-FaHardNegatives 0.5419 0.4615
FiQA2018-Fa.v2 0.2298 0.2946
HotpotQA-FaHardNegatives 0.5657 0.6153
MSMARCO-FaHardNegatives 0.6667 0.6871
NLPTwitterAnalysisClassification.v2 0.7492 0.7659
NQ-FaHardNegatives 0.4497 0.4983
PerShopDomainClassification 0.5054 0.5517
PerShopIntentClassification 0.9003 0.9069
PersianTextEmotion.v2 0.5281 0.6091
QuoraRetrieval-Fa.v2 0.7468 0.7788
SCIDOCS-Fa.v2 0.1182 0.1222
SIDClassification.v2 0.6073 0.6137
SciFact-Fa.v2 0.5818 0.6037
StyleClassification 0.6172 0.6492
SynPerTextToneClassification.v3 0.8018 0.8412
TRECCOVID-Fa.v2 0.6291 0.7177
Touche2020-Fa.v2 0.4594 0.4978
Average 0.5593 0.5897

Results for intfloat/multilingual-e5-large

task_name google/gemini-embedding-001 intfloat/multilingual-e5-large Max result
ArguAna-Fa.v2 nan 0.4127
DeepSentiPers.v2 nan 0.5769
FEVER-FaHardNegatives nan 0.4615
FiQA2018-Fa.v2 nan 0.2946
HotpotQA-FaHardNegatives nan 0.6153
MIRACLRetrievalHardNegatives 0.6163 0.5923 0.6257
MSMARCO-FaHardNegatives nan 0.6871
NLPTwitterAnalysisClassification.v2 nan 0.7659
NQ-FaHardNegatives nan 0.4983
NeuCLIR2023RetrievalHardNegatives nan 0.5059 0.5950
PerShopDomainClassification nan 0.5517
PerShopIntentClassification nan 0.9069
PersianTextEmotion.v2 nan 0.6091
QuoraRetrieval-Fa.v2 nan 0.7788
SCIDOCS-Fa.v2 nan 0.1222
SIDClassification.v2 nan 0.6137
SciFact-Fa.v2 nan 0.6037
StyleClassification nan 0.6492
SynPerTextToneClassification.v3 nan 0.8412
TRECCOVID-Fa.v2 nan 0.7177
Touche2020-Fa.v2 nan 0.4978
Average 0.6163 0.5858 0.6103

Results for jinaai/jina-embeddings-v3

task_name intfloat/multilingual-e5-large jinaai/jina-embeddings-v3 Max result
ArguAna-Fa.v2 0.4127 0.3852
DeepSentiPers.v2 0.5769 0.6407
FEVER-FaHardNegatives 0.4615 0.7050
FiQA2018-Fa.v2 0.2946 0.3551
HotpotQA-FaHardNegatives 0.6153 0.5401
MSMARCO-FaHardNegatives 0.6871 0.6740
NLPTwitterAnalysisClassification.v2 0.7659 0.7677
NQ-FaHardNegatives 0.4983 0.5305
NeuCLIR2023RetrievalHardNegatives 0.5059 0.5896 0.5950
PerShopDomainClassification 0.5517 0.5644
PerShopIntentClassification 0.9069 0.8583
PersianTextEmotion.v2 0.6091 0.5217
QuoraRetrieval-Fa.v2 0.7788 0.5715
SCIDOCS-Fa.v2 0.1222 0.1426
SIDClassification.v2 0.6137 0.6165
SciFact-Fa.v2 0.6037 0.6123
StyleClassification 0.6492 0.6133
SynPerTextToneClassification.v3 0.8412 0.7064
TRECCOVID-Fa.v2 0.7177 0.6858
Touche2020-Fa.v2 0.4978 0.4901
Average 0.5855 0.5785 0.5950

Results for m3hrdadfi/bert-zwnj-wnli-mean-tokens

task_name google/gemini-embedding-001 intfloat/multilingual-e5-large m3hrdadfi/bert-zwnj-wnli-mean-tokens Max result
ArguAna-Fa.v2 nan 0.4127 0.2087
DeepSentiPers.v2 nan 0.5769 0.4511
FEVER-FaHardNegatives nan 0.4615 0.0355
FiQA2018-Fa.v2 nan 0.2946 0.0207
HotpotQA-FaHardNegatives nan 0.6153 0.0232
MIRACLRetrievalHardNegatives 0.6163 0.5923 0.0797 0.6257
MSMARCO-FaHardNegatives nan 0.6871 0.2251
NLPTwitterAnalysisClassification.v2 nan 0.7659 0.7129
NQ-FaHardNegatives nan 0.4983 0.0374
NeuCLIR2023RetrievalHardNegatives nan 0.5059 0.1744 0.5950
PerShopDomainClassification nan 0.5517 0.6182
PerShopIntentClassification nan 0.9069 0.8858
PersianTextEmotion.v2 nan 0.6091 0.3833
QuoraRetrieval-Fa.v2 nan 0.7788 0.4677
SCIDOCS-Fa.v2 nan 0.1222 0.0181
SIDClassification.v2 nan 0.6137 0.4827
SciFact-Fa.v2 nan 0.6037 0.0742
StyleClassification nan 0.6492 0.8198
SynPerTextToneClassification.v3 nan 0.8412 0.8910
TRECCOVID-Fa.v2 nan 0.7177 0.1643
Touche2020-Fa.v2 nan 0.4978 0.1279
WebFAQRetrieval nan 0.7459 0.1780 0.7813
Average 0.6163 0.5931 0.3218 0.6673

Results for m3hrdadfi/roberta-zwnj-wnli-mean-tokens

task_name google/gemini-embedding-001 intfloat/multilingual-e5-large m3hrdadfi/roberta-zwnj-wnli-mean-tokens Max result
ArguAna-Fa.v2 nan 0.4127 0.2202
DeepSentiPers.v2 nan 0.5769 0.4354
FEVER-FaHardNegatives nan 0.4615 0.0262
FiQA2018-Fa.v2 nan 0.2946 0.0321
HotpotQA-FaHardNegatives nan 0.6153 0.0266
MIRACLRetrievalHardNegatives 0.6163 0.5923 0.0725 0.6257
MSMARCO-FaHardNegatives nan 0.6871 0.3186
NLPTwitterAnalysisClassification.v2 nan 0.7659 0.7092
NQ-FaHardNegatives nan 0.4983 0.0440
NeuCLIR2023RetrievalHardNegatives nan 0.5059 0.1857 0.5950
PerShopDomainClassification nan 0.5517 0.5375
PerShopIntentClassification nan 0.9069 0.8766
PersianTextEmotion.v2 nan 0.6091 0.3756
QuoraRetrieval-Fa.v2 nan 0.7788 0.4789
SCIDOCS-Fa.v2 nan 0.1222 0.0343
SIDClassification.v2 nan 0.6137 0.5041
SciFact-Fa.v2 nan 0.6037 0.0784
StyleClassification nan 0.6492 0.8182
SynPerTextToneClassification.v3 nan 0.8412 0.8930
TRECCOVID-Fa.v2 nan 0.7177 0.2010
Touche2020-Fa.v2 nan 0.4978 0.1387
WebFAQRetrieval nan 0.7459 0.1835 0.7813
Average 0.6163 0.5931 0.3268 0.6673

Results for myrkur/sentence-transformer-parsbert-fa

task_name google/gemini-embedding-001 intfloat/multilingual-e5-large myrkur/sentence-transformer-parsbert-fa Max result
ArguAna-Fa.v2 nan 0.4127 0.2103
DeepSentiPers.v2 nan 0.5769 0.4116
FEVER-FaHardNegatives nan 0.4615 0.0265
FiQA2018-Fa.v2 nan 0.2946 0.0100
HotpotQA-FaHardNegatives nan 0.6153 0.0132
MIRACLRetrievalHardNegatives 0.6163 0.5923 0.0537 0.6257
MSMARCO-FaHardNegatives nan 0.6871 0.2412
NLPTwitterAnalysisClassification.v2 nan 0.7659 0.7464
NQ-FaHardNegatives nan 0.4983 0.0171
NeuCLIR2023RetrievalHardNegatives nan 0.5059 0.2394 0.5950
PerShopDomainClassification nan 0.5517 0.6930
PerShopIntentClassification nan 0.9069 0.8582
PersianTextEmotion.v2 nan 0.6091 0.3882
QuoraRetrieval-Fa.v2 nan 0.7788 0.4700
SCIDOCS-Fa.v2 nan 0.1222 0.0206
SIDClassification.v2 nan 0.6137 0.5500
SciFact-Fa.v2 nan 0.6037 0.0496
StyleClassification nan 0.6492 0.7458
SynPerTextToneClassification.v3 nan 0.8412 0.7516
TRECCOVID-Fa.v2 nan 0.7177 0.1005
Touche2020-Fa.v2 nan 0.4978 0.0399
WebFAQRetrieval nan 0.7459 0.1252 0.7813
Average 0.6163 0.5931 0.3074 0.6673

Results for openai/text-embedding-3-small

task_name google/gemini-embedding-001 intfloat/multilingual-e5-large openai/text-embedding-3-small Max result
ArguAna-Fa.v2 nan 0.4127 0.3328
BeytooteClustering nan 0.6150 0.6038 0.6252
DeepSentiPers.v2 nan 0.5769 0.5044
DigikalamagClassification nan 0.8705 0.8493 0.8631
DigikalamagClustering nan 0.3989 0.4597 0.4748
FEVER-FaHardNegatives nan 0.4615 0.2722
FarsTail nan 0.7255 0.6885 0.7478
FarsiParaphraseDetection nan 0.9757 0.9483 0.9706
Farsick nan 0.7067 0.6085 0.7095
FiQA2018-Fa.v2 nan 0.2946 0.1011
HamshahriClustring nan 0.6742 0.6633 0.6983
HotpotQA-FaHardNegatives nan 0.6153 0.2813
MIRACLReranking nan 0.5936 0.3477 0.6026
MIRACLRetrievalHardNegatives 0.6163 0.5923 0.2724 0.6257
MSMARCO-FaHardNegatives nan 0.6871 0.4767
MassiveIntentClassification 0.8349 0.6549 0.5217 0.8349
MassiveScenarioClassification 0.8863 0.6859 0.5667 0.8863
NLPTwitterAnalysisClassification.v2 nan 0.7659 0.7203
NLPTwitterAnalysisClustering nan 0.7848 0.7892 0.8082
NQ-FaHardNegatives nan 0.4983 0.1996
NeuCLIR2023RetrievalHardNegatives nan 0.5059 0.3452 0.5950
ParsinluEntail nan 0.6546 0.5747 0.6655
ParsinluQueryParaphPC nan 0.8783 0.7701 0.8709
PerShopDomainClassification nan 0.5517 0.4921
PerShopIntentClassification nan 0.9069 0.8870
PersianFoodSentimentClassification nan 0.8212 0.6615 0.8105
PersianTextEmotion.v2 nan 0.6091 0.4459
PersianWebDocumentRetrieval nan 0.4676 0.3508 0.5067
QuoraRetrieval-Fa.v2 nan 0.7788 0.6223
SAMSumFa nan 0.9242 0.8461 0.9247
SCIDOCS-Fa.v2 nan 0.1222 0.0840
SIDClassification.v2 nan 0.6137 0.5272
SIDClustring nan 0.3865 0.3568 0.4102
SciFact-Fa.v2 nan 0.6037 0.4125
StyleClassification nan 0.6492 0.6229
SynPerChatbotConvSAAnger nan 0.7193 0.8539 0.8661
SynPerChatbotConvSAClassification nan 0.6077 0.7151 0.7472
SynPerChatbotConvSAFear nan 0.7419 0.7641 0.7769
SynPerChatbotConvSAFriendship nan 0.5283 0.5891 0.6268
SynPerChatbotConvSAHappiness nan 0.5246 0.6398 0.7398
SynPerChatbotConvSAJealousy nan 0.7034 0.7276 0.7621
SynPerChatbotConvSALove nan 0.4629 0.5200 0.6086
SynPerChatbotConvSASadness nan 0.6441 0.7922 0.8167
SynPerChatbotConvSASatisfaction nan 0.6058 0.8664 0.8079
SynPerChatbotConvSASurprise nan 0.5388 0.6826 0.7198
SynPerChatbotConvSAToneChatbotClassification nan 0.5807 0.6666 0.8198
SynPerChatbotConvSAToneUserClassification nan 0.5260 0.6037 0.6197
SynPerChatbotRAGFAQPC nan 0.6303 0.6896 0.6677
SynPerChatbotRAGFAQRetrieval nan 0.2348 0.2826 0.4405
SynPerChatbotRAGSumSRetrieval nan 0.4981 0.4781 0.6037
SynPerChatbotSatisfactionLevelClassification nan 0.2523 0.3664 0.3343
SynPerChatbotSumSRetrieval nan 0.2760 0.2216 0.3678
SynPerQAPC nan 0.9516 0.9056 0.9320
SynPerQARetrieval nan 0.8735 0.6358 0.8681
SynPerSTS nan 0.8798 0.7733 0.8691
SynPerTextKeywordsPC nan 0.9479 0.9146 0.9640
SynPerTextToneClassification.v3 nan 0.8412 0.7289
TRECCOVID-Fa.v2 nan 0.7177 0.2937
WebFAQRetrieval nan 0.7459 0.5044 0.7813
WikipediaRerankingMultilingual 0.9120 0.8932 0.8094 0.9120
WikipediaRetrievalMultilingual 0.9357 0.9040 0.7532 0.9357
Average 0.8370 0.6376 0.5735 0.7260

Model have high performance on these tasks: SynPerChatbotConvSASatisfaction,SynPerChatbotRAGFAQPC,SynPerChatbotSatisfactionLevelClassification


Results for sbunlp/fabert

task_name google/gemini-embedding-001 intfloat/multilingual-e5-large sbunlp/fabert Max result
ArguAna-Fa.v2 nan 0.4127 0.1913
DeepSentiPers.v2 nan 0.5769 0.4175
FEVER-FaHardNegatives nan 0.4615 0.0463
FiQA2018-Fa.v2 nan 0.2946 0.0367
HotpotQA-FaHardNegatives nan 0.6153 0.0926
MIRACLRetrievalHardNegatives 0.6163 0.5923 0.1285 0.6257
MSMARCO-FaHardNegatives nan 0.6871 0.2718
NLPTwitterAnalysisClassification.v2 nan 0.7659 0.7106
NQ-FaHardNegatives nan 0.4983 0.0783
NeuCLIR2023RetrievalHardNegatives nan 0.5059 0.3032 0.5950
PerShopDomainClassification nan 0.5517 0.5465
PerShopIntentClassification nan 0.9069 0.8962
PersianTextEmotion.v2 nan 0.6091 0.4884
QuoraRetrieval-Fa.v2 nan 0.7788 0.5246
SCIDOCS-Fa.v2 nan 0.1222 0.0411
SIDClassification.v2 nan 0.6137 0.5400
SciFact-Fa.v2 nan 0.6037 0.1848
StyleClassification nan 0.6492 0.9771
SynPerTextToneClassification.v3 nan 0.8412 0.9840
TRECCOVID-Fa.v2 nan 0.7177 0.1810
Touche2020-Fa.v2 nan 0.4978 0.0705
WebFAQRetrieval nan 0.7459 0.2758 0.7813
Average 0.6163 0.5931 0.3630 0.6673

Results for sentence-transformers/LaBSE

task_name intfloat/multilingual-e5-large sentence-transformers/LaBSE Max result
ArguAna-Fa.v2 0.4127 0.3812
DeepSentiPers.v2 0.5769 0.5805
FEVER-FaHardNegatives 0.4615 0.1350
FiQA2018-Fa.v2 0.2946 0.0550
HotpotQA-FaHardNegatives 0.6153 0.1619
MSMARCO-FaHardNegatives 0.6871 0.3307
NLPTwitterAnalysisClassification.v2 0.7659 0.7536
NQ-FaHardNegatives 0.4983 0.1221
PerShopDomainClassification 0.5517 0.5665
PerShopIntentClassification 0.9069 0.9284
PersianTextEmotion.v2 0.6091 0.5333
QuoraRetrieval-Fa.v2 0.7788 0.7026
SCIDOCS-Fa.v2 0.1222 0.0713
SIDClassification.v2 0.6137 0.5672
SciFact-Fa.v2 0.6037 0.3387
StyleClassification 0.6492 0.5664
SynPerTextToneClassification.v3 0.8412 0.6849
TRECCOVID-Fa.v2 0.7177 0.2083
Touche2020-Fa.v2 0.4978 0.1362
WebFAQRetrieval 0.7459 0.3671 0.7813
Average 0.5975 0.4095 0.7813

Results for sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2

task_name intfloat/multilingual-e5-large sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 Max result
ArguAna-Fa.v2 0.4127 0.3900
DeepSentiPers.v2 0.5769 0.5602
FEVER-FaHardNegatives 0.4615 0.2021
FiQA2018-Fa.v2 0.2946 0.0956
HotpotQA-FaHardNegatives 0.6153 0.1341
MSMARCO-FaHardNegatives 0.6871 0.4321
NLPTwitterAnalysisClassification.v2 0.7659 0.7547
NQ-FaHardNegatives 0.4983 0.1650
PerShopDomainClassification 0.5517 0.5642
PerShopIntentClassification 0.9069 0.8562
PersianTextEmotion.v2 0.6091 0.4471
QuoraRetrieval-Fa.v2 0.7788 0.7148
SCIDOCS-Fa.v2 0.1222 0.0879
SIDClassification.v2 0.6137 0.5447
SciFact-Fa.v2 0.6037 0.3195
StyleClassification 0.6492 0.5122
SynPerTextToneClassification.v3 0.8412 0.6079
TRECCOVID-Fa.v2 0.7177 0.3495
Touche2020-Fa.v2 0.4978 0.3442
Average 0.5897 0.4254

@mehran-sarmadi
Copy link
Contributor Author

Hi @KennethEnevoldsen , could you please take a look at this PR and let me know if everything looks good or if I should make changes? Thanks!

@KennethEnevoldsen KennethEnevoldsen enabled auto-merge (squash) September 22, 2025 13:58
@KennethEnevoldsen KennethEnevoldsen merged commit 5af2720 into embeddings-benchmark:main Sep 22, 2025
3 checks passed
@KennethEnevoldsen
Copy link
Contributor

thanks for the ping :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

waiting for review of implementation This PR is waiting for an implementation review before merging the results.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants