Famteb v2 Results #273

mehran-sarmadi · 2025-09-08T16:39:02Z

Checklist

The results submitted is obtained using the reference implementation
My model is available, either as a publicly accessible API or publicly on e.g., Huggingface
I solemnly swear that for all results submitted I have not trained on the evaluation dataset including training splits. If I have I have disclosed it clearly.

github-actions · 2025-09-21T11:26:12Z

Model Results Comparison

Reference models: intfloat/multilingual-e5-large, google/gemini-embedding-001
New models evaluated: Alibaba-NLP/gte-Qwen2-7B-instruct, BAAI/bge-m3, HooshvareLab/bert-base-parsbert-uncased, MCINext/Hakim-small, MCINext/Hakim-unsup, MCINext/Hakim, PartAI/Tooka-SBERT-V2-Large, PartAI/Tooka-SBERT-V2-Small, PartAI/Tooka-SBERT, PartAI/TookaBERT-Base, google/embeddinggemma-300m, intfloat/e5-mistral-7b-instruct, intfloat/multilingual-e5-base, intfloat/multilingual-e5-large, jinaai/jina-embeddings-v3, m3hrdadfi/bert-zwnj-wnli-mean-tokens, m3hrdadfi/roberta-zwnj-wnli-mean-tokens, myrkur/sentence-transformer-parsbert-fa, openai/text-embedding-3-small, sbunlp/fabert, sentence-transformers/LaBSE, sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
Tasks: ArguAna-Fa.v2, BeytooteClustering, DeepSentiPers.v2, DigikalamagClassification, DigikalamagClustering, FEVER-FaHardNegatives, FarsTail, FarsiParaphraseDetection, Farsick, FiQA2018-Fa.v2, HamshahriClustring, HotpotQA-FaHardNegatives, MIRACLReranking, MIRACLRetrievalHardNegatives, MSMARCO-FaHardNegatives, MassiveIntentClassification, MassiveScenarioClassification, NLPTwitterAnalysisClassification.v2, NLPTwitterAnalysisClustering, NQ-FaHardNegatives, NeuCLIR2023RetrievalHardNegatives, ParsinluEntail, ParsinluQueryParaphPC, PerShopDomainClassification, PerShopIntentClassification, PersianFoodSentimentClassification, PersianTextEmotion.v2, PersianWebDocumentRetrieval, QuoraRetrieval-Fa.v2, SAMSumFa, SCIDOCS-Fa.v2, SIDClassification.v2, SIDClustring, SciFact-Fa.v2, StyleClassification, SynPerChatbotConvSAAnger, SynPerChatbotConvSAClassification, SynPerChatbotConvSAFear, SynPerChatbotConvSAFriendship, SynPerChatbotConvSAHappiness, SynPerChatbotConvSAJealousy, SynPerChatbotConvSALove, SynPerChatbotConvSASadness, SynPerChatbotConvSASatisfaction, SynPerChatbotConvSASurprise, SynPerChatbotConvSAToneChatbotClassification, SynPerChatbotConvSAToneUserClassification, SynPerChatbotRAGFAQPC, SynPerChatbotRAGFAQRetrieval, SynPerChatbotRAGSumSRetrieval, SynPerChatbotSatisfactionLevelClassification, SynPerChatbotSumSRetrieval, SynPerQAPC, SynPerQARetrieval, SynPerSTS, SynPerTextKeywordsPC, SynPerTextToneClassification.v3, TRECCOVID-Fa.v2, Touche2020-Fa.v2, WebFAQRetrieval, WikipediaRerankingMultilingual, WikipediaRetrievalMultilingual

Results for `Alibaba-NLP/gte-Qwen2-7B-instruct`

task_name	Alibaba-NLP/gte-Qwen2-7B-instruct	intfloat/multilingual-e5-large	Max result
ArguAna-Fa.v2	0.4259	0.4127
DeepSentiPers.v2	0.5794	0.5769
FEVER-FaHardNegatives	0.6967	0.4615
FiQA2018-Fa.v2	0.3274	0.2946
HotpotQA-FaHardNegatives	0.5159	0.6153
MSMARCO-FaHardNegatives	0.6295	0.6871
NLPTwitterAnalysisClassification.v2	0.7877	0.7659
NQ-FaHardNegatives	0.4559	0.4983
NeuCLIR2023RetrievalHardNegatives	0.5953	0.5059	0.5950
PerShopDomainClassification	0.5851	0.5517
PerShopIntentClassification	0.8809	0.9069
PersianTextEmotion.v2	0.4196	0.6091
QuoraRetrieval-Fa.v2	0.7500	0.7788
SCIDOCS-Fa.v2	0.1622	0.1222
SIDClassification.v2	0.6263	0.6137
SciFact-Fa.v2	0.6472	0.6037
StyleClassification	0.5943	0.6492
SynPerTextToneClassification.v3	0.6505	0.8412
TRECCOVID-Fa.v2	0.7496	0.7177
Touche2020-Fa.v2	0.4224	0.4978
WebFAQRetrieval	0.7127	0.7459	0.7813
Average	0.5816	0.5931	0.6881

Model have high performance on these tasks: NeuCLIR2023RetrievalHardNegatives

Results for `BAAI/bge-m3`

task_name	BAAI/bge-m3	intfloat/multilingual-e5-large	Max result
ArguAna-Fa.v2	0.5403	0.4127
DeepSentiPers.v2	0.6678	0.5769
FEVER-FaHardNegatives	0.6421	0.4615
FiQA2018-Fa.v2	0.3023	0.2946
HotpotQA-FaHardNegatives	0.5762	0.6153
MSMARCO-FaHardNegatives	0.6847	0.6871
NLPTwitterAnalysisClassification.v2	0.7726	0.7659
NQ-FaHardNegatives	0.5021	0.4983
PerShopDomainClassification	0.6646	0.5517
PerShopIntentClassification	0.8988	0.9069
PersianTextEmotion.v2	0.5981	0.6091
QuoraRetrieval-Fa.v2	0.8031	0.7788
SCIDOCS-Fa.v2	0.1499	0.1222
SIDClassification.v2	0.5962	0.6137
SciFact-Fa.v2	0.5858	0.6037
StyleClassification	0.5586	0.6492
SynPerTextToneClassification.v3	0.7263	0.8412
TRECCOVID-Fa.v2	0.7338	0.7177
Touche2020-Fa.v2	0.4853	0.4978
WebFAQRetrieval	0.7726	0.7459	0.7813
Average	0.6131	0.5975	0.7813

Results for `HooshvareLab/bert-base-parsbert-uncased`

task_name	HooshvareLab/bert-base-parsbert-uncased	google/gemini-embedding-001	intfloat/multilingual-e5-large	Max result
ArguAna-Fa.v2	0.1969	nan	0.4127
DeepSentiPers.v2	0.4971	nan	0.5769
FEVER-FaHardNegatives	0.0170	nan	0.4615
FiQA2018-Fa.v2	0.0153	nan	0.2946
HotpotQA-FaHardNegatives	0.0630	nan	0.6153
MIRACLRetrievalHardNegatives	0.0775	0.6163	0.5923	0.6257
MSMARCO-FaHardNegatives	0.2492	nan	0.6871
NLPTwitterAnalysisClassification.v2	0.7153	nan	0.7659
NQ-FaHardNegatives	0.0456	nan	0.4983
NeuCLIR2023RetrievalHardNegatives	0.1171	nan	0.5059	0.5950
PerShopDomainClassification	0.6962	nan	0.5517
PerShopIntentClassification	0.9168	nan	0.9069
PersianTextEmotion.v2	0.4763	nan	0.6091
QuoraRetrieval-Fa.v2	0.4848	nan	0.7788
SCIDOCS-Fa.v2	0.0161	nan	0.1222
SIDClassification.v2	0.5571	nan	0.6137
SciFact-Fa.v2	0.0834	nan	0.6037
StyleClassification	0.9586	nan	0.6492
SynPerTextToneClassification.v3	0.9745	nan	0.8412
TRECCOVID-Fa.v2	0.0867	nan	0.7177
Touche2020-Fa.v2	0.0443	nan	0.4978
WebFAQRetrieval	0.1822	nan	0.7459	0.7813
Average	0.3396	0.6163	0.5931	0.6673

Results for `MCINext/Hakim-small`

task_name	MCINext/Hakim-small	google/gemini-embedding-001	intfloat/multilingual-e5-large	Max result
ArguAna-Fa.v2	0.4268	nan	0.4127
DeepSentiPers.v2	0.6527	nan	0.5769
FEVER-FaHardNegatives	0.5176	nan	0.4615
FiQA2018-Fa.v2	0.1912	nan	0.2946
HotpotQA-FaHardNegatives	0.4450	nan	0.6153
MIRACLRetrievalHardNegatives	0.4488	0.6163	0.5923	0.6257
MSMARCO-FaHardNegatives	0.6399	nan	0.6871
NLPTwitterAnalysisClassification.v2	0.7819	nan	0.7659
NQ-FaHardNegatives	0.3085	nan	0.4983
NeuCLIR2023RetrievalHardNegatives	0.4937	nan	0.5059	0.5950
PerShopDomainClassification	0.6896	nan	0.5517
PerShopIntentClassification	0.8633	nan	0.9069
PersianTextEmotion.v2	0.7719	nan	0.6091
QuoraRetrieval-Fa.v2	0.7197	nan	0.7788
SCIDOCS-Fa.v2	0.0981	nan	0.1222
SIDClassification.v2	0.6463	nan	0.6137
SciFact-Fa.v2	0.4977	nan	0.6037
StyleClassification	0.7969	nan	0.6492
SynPerTextToneClassification.v3	0.9184	nan	0.8412
TRECCOVID-Fa.v2	0.4367	nan	0.7177
Touche2020-Fa.v2	0.3738	nan	0.4978
WebFAQRetrieval	0.6934	nan	0.7459	0.7813
Average	0.5642	0.6163	0.5931	0.6673

Results for `MCINext/Hakim-unsup`

task_name	MCINext/Hakim-unsup	google/gemini-embedding-001	intfloat/multilingual-e5-large	Max result
ArguAna-Fa.v2	0.4020	nan	0.4127
DeepSentiPers.v2	0.6494	nan	0.5769
FEVER-FaHardNegatives	0.3961	nan	0.4615
FiQA2018-Fa.v2	0.1705	nan	0.2946
HotpotQA-FaHardNegatives	0.4381	nan	0.6153
MIRACLRetrievalHardNegatives	0.5143	0.6163	0.5923	0.6257
MSMARCO-FaHardNegatives	0.6082	nan	0.6871
NLPTwitterAnalysisClassification.v2	0.7669	nan	0.7659
NQ-FaHardNegatives	0.3514	nan	0.4983
NeuCLIR2023RetrievalHardNegatives	0.5333	nan	0.5059	0.5950
PerShopDomainClassification	0.7181	nan	0.5517
PerShopIntentClassification	0.8907	nan	0.9069
PersianTextEmotion.v2	0.6460	nan	0.6091
QuoraRetrieval-Fa.v2	0.7592	nan	0.7788
SCIDOCS-Fa.v2	0.1261	nan	0.1222
SIDClassification.v2	0.6022	nan	0.6137
SciFact-Fa.v2	0.4874	nan	0.6037
StyleClassification	0.7484	nan	0.6492
SynPerTextToneClassification.v3	0.8072	nan	0.8412
TRECCOVID-Fa.v2	0.5779	nan	0.7177
WebFAQRetrieval	0.6611	nan	0.7459	0.7813
Average	0.5645	0.6163	0.5976	0.6673

Results for `MCINext/Hakim`

task_name	MCINext/Hakim	google/gemini-embedding-001	intfloat/multilingual-e5-large	Max result
ArguAna-Fa.v2	0.4613	nan	0.4127
DeepSentiPers.v2	0.7227	nan	0.5769
FEVER-FaHardNegatives	0.5014	nan	0.4615
FiQA2018-Fa.v2	0.2446	nan	0.2946
HotpotQA-FaHardNegatives	0.4799	nan	0.6153
MIRACLRetrievalHardNegatives	0.4725	0.6163	0.5923	0.6257
MSMARCO-FaHardNegatives	0.6472	nan	0.6871
NLPTwitterAnalysisClassification.v2	0.8001	nan	0.7659
NQ-FaHardNegatives	0.3475	nan	0.4983
NeuCLIR2023RetrievalHardNegatives	0.4933	nan	0.5059	0.5950
PerShopDomainClassification	0.6669	nan	0.5517
PerShopIntentClassification	0.8792	nan	0.9069
PersianTextEmotion.v2	0.8645	nan	0.6091
QuoraRetrieval-Fa.v2	0.7457	nan	0.7788
SCIDOCS-Fa.v2	0.1050	nan	0.1222
SIDClassification.v2	0.6845	nan	0.6137
SciFact-Fa.v2	0.5379	nan	0.6037
StyleClassification	0.6896	nan	0.6492
SynPerTextToneClassification.v3	0.9388	nan	0.8412
TRECCOVID-Fa.v2	0.5485	nan	0.7177
Touche2020-Fa.v2	0.3975	nan	0.4978
WebFAQRetrieval	0.7388	nan	0.7459	0.7813
Average	0.5894	0.6163	0.5931	0.6673

Results for `PartAI/Tooka-SBERT-V2-Large`

task_name	PartAI/Tooka-SBERT-V2-Large	google/gemini-embedding-001	intfloat/multilingual-e5-large	Max result
ArguAna-Fa.v2	0.4369	nan	0.4127
DeepSentiPers.v2	0.6564	nan	0.5769
FEVER-FaHardNegatives	0.2492	nan	0.4615
FiQA2018-Fa.v2	0.1921	nan	0.2946
HotpotQA-FaHardNegatives	0.3533	nan	0.6153
MIRACLRetrievalHardNegatives	0.4854	0.6163	0.5923	0.6257
MSMARCO-FaHardNegatives	0.5982	nan	0.6871
NLPTwitterAnalysisClassification.v2	0.7780	nan	0.7659
NQ-FaHardNegatives	0.3190	nan	0.4983
NeuCLIR2023RetrievalHardNegatives	0.5561	nan	0.5059	0.5950
PerShopDomainClassification	0.7607	nan	0.5517
PerShopIntentClassification	0.8914	nan	0.9069
PersianTextEmotion.v2	0.5685	nan	0.6091
QuoraRetrieval-Fa.v2	0.7743	nan	0.7788
SCIDOCS-Fa.v2	0.1160	nan	0.1222
SIDClassification.v2	0.5535	nan	0.6137
SciFact-Fa.v2	0.4103	nan	0.6037
StyleClassification	0.8471	nan	0.6492
SynPerTextToneClassification.v3	0.9294	nan	0.8412
TRECCOVID-Fa.v2	0.6676	nan	0.7177
Touche2020-Fa.v2	0.4374	nan	0.4978
WebFAQRetrieval	0.6641	nan	0.7459	0.7813
Average	0.5566	0.6163	0.5931	0.6673

Results for `PartAI/Tooka-SBERT-V2-Small`

task_name	PartAI/Tooka-SBERT-V2-Small	google/gemini-embedding-001	intfloat/multilingual-e5-large	Max result
ArguAna-Fa.v2	0.4564	nan	0.4127
DeepSentiPers.v2	0.6087	nan	0.5769
FEVER-FaHardNegatives	0.3968	nan	0.4615
FiQA2018-Fa.v2	0.1687	nan	0.2946
HotpotQA-FaHardNegatives	0.3601	nan	0.6153
MIRACLRetrievalHardNegatives	0.5306	0.6163	0.5923	0.6257
MSMARCO-FaHardNegatives	0.5802	nan	0.6871
NLPTwitterAnalysisClassification.v2	0.7717	nan	0.7659
NQ-FaHardNegatives	0.3377	nan	0.4983
NeuCLIR2023RetrievalHardNegatives	0.5524	nan	0.5059	0.5950
PerShopDomainClassification	0.7465	nan	0.5517
PerShopIntentClassification	0.8816	nan	0.9069
PersianTextEmotion.v2	0.5251	nan	0.6091
QuoraRetrieval-Fa.v2	0.7474	nan	0.7788
SCIDOCS-Fa.v2	0.1166	nan	0.1222
SIDClassification.v2	0.5459	nan	0.6137
SciFact-Fa.v2	0.4138	nan	0.6037
StyleClassification	0.8768	nan	0.6492
SynPerTextToneClassification.v3	0.8429	nan	0.8412
TRECCOVID-Fa.v2	0.6560	nan	0.7177
Touche2020-Fa.v2	0.4269	nan	0.4978
WebFAQRetrieval	0.6332	nan	0.7459	0.7813
Average	0.5534	0.6163	0.5931	0.6673

Results for `PartAI/Tooka-SBERT`

task_name	PartAI/Tooka-SBERT	google/gemini-embedding-001	intfloat/multilingual-e5-large	Max result
ArguAna-Fa.v2	0.3253	nan	0.4127
DeepSentiPers.v2	0.6345	nan	0.5769
FEVER-FaHardNegatives	0.1515	nan	0.4615
FiQA2018-Fa.v2	0.1267	nan	0.2946
HotpotQA-FaHardNegatives	0.2374	nan	0.6153
MIRACLRetrievalHardNegatives	0.2643	0.6163	0.5923	0.6257
MSMARCO-FaHardNegatives	0.4732	nan	0.6871
NLPTwitterAnalysisClassification.v2	0.7563	nan	0.7659
NQ-FaHardNegatives	0.1804	nan	0.4983
NeuCLIR2023RetrievalHardNegatives	0.4927	nan	0.5059	0.5950
PerShopDomainClassification	0.7372	nan	0.5517
PerShopIntentClassification	0.8810	nan	0.9069
PersianTextEmotion.v2	0.5682	nan	0.6091
QuoraRetrieval-Fa.v2	0.7588	nan	0.7788
SCIDOCS-Fa.v2	0.0973	nan	0.1222
SIDClassification.v2	0.5325	nan	0.6137
SciFact-Fa.v2	0.3798	nan	0.6037
StyleClassification	0.7591	nan	0.6492
SynPerTextToneClassification.v3	0.7462	nan	0.8412
TRECCOVID-Fa.v2	0.5796	nan	0.7177
Touche2020-Fa.v2	0.3149	nan	0.4978
WebFAQRetrieval	0.5454	nan	0.7459	0.7813
Average	0.4792	0.6163	0.5931	0.6673

Results for `PartAI/TookaBERT-Base`

task_name	PartAI/TookaBERT-Base	google/gemini-embedding-001	intfloat/multilingual-e5-large	Max result
ArguAna-Fa.v2	0.2671	nan	0.4127
DeepSentiPers.v2	0.5406	nan	0.5769
FEVER-FaHardNegatives	0.0081	nan	0.4615
FiQA2018-Fa.v2	0.0229	nan	0.2946
HotpotQA-FaHardNegatives	0.0494	nan	0.6153
MIRACLRetrievalHardNegatives	0.0521	0.6163	0.5923	0.6257
MSMARCO-FaHardNegatives	0.2741	nan	0.6871
NLPTwitterAnalysisClassification.v2	0.7149	nan	0.7659
NQ-FaHardNegatives	0.0343	nan	0.4983
NeuCLIR2023RetrievalHardNegatives	0.2021	nan	0.5059	0.5950
PerShopDomainClassification	0.6516	nan	0.5517
PerShopIntentClassification	0.9021	nan	0.9069
PersianTextEmotion.v2	0.5288	nan	0.6091
QuoraRetrieval-Fa.v2	0.5060	nan	0.7788
SCIDOCS-Fa.v2	0.0330	nan	0.1222
SIDClassification.v2	0.5876	nan	0.6137
SciFact-Fa.v2	0.1684	nan	0.6037
StyleClassification	0.9641	nan	0.6492
SynPerTextToneClassification.v3	0.9851	nan	0.8412
TRECCOVID-Fa.v2	0.1120	nan	0.7177
Touche2020-Fa.v2	0.0330	nan	0.4978
WebFAQRetrieval	0.2016	nan	0.7459	0.7813
Average	0.3563	0.6163	0.5931	0.6673

Results for `google/embeddinggemma-300m`

task_name	google/embeddinggemma-300m	intfloat/multilingual-e5-large	Max result
ArguAna-Fa.v2	0.5849	0.4127
BeytooteClustering	0.6510	0.6150	0.6252
DeepSentiPers.v2	0.6027	0.5769
DigikalamagClassification	0.8779	0.8705	0.8631
DigikalamagClustering	0.5143	0.3989	0.4748
FEVER-FaHardNegatives	0.6769	0.4615
FarsTail	0.7703	0.7255	0.7478
FarsiParaphraseDetection	0.9623	0.9757	0.9706
Farsick	0.7108	0.7067	0.7095
FiQA2018-Fa.v2	0.2880	0.2946
HamshahriClustring	0.7207	0.6742	0.6983
HotpotQA-FaHardNegatives	0.5481	0.6153
MIRACLReranking	0.5250	0.5936	0.6026
MSMARCO-FaHardNegatives	0.6236	0.6871
NLPTwitterAnalysisClassification.v2	0.7791	0.7659
NLPTwitterAnalysisClustering	0.8112	0.7848	0.8082
NQ-FaHardNegatives	0.4418	0.4983
NeuCLIR2023RetrievalHardNegatives	0.6225	0.5059	0.5950
ParsinluEntail	0.7103	0.6546	0.6655
ParsinluQueryParaphPC	0.8841	0.8783	0.8709
PerShopDomainClassification	0.6273	0.5517
PerShopIntentClassification	0.9061	0.9069
PersianFoodSentimentClassification	0.8254	0.8212	0.8105
PersianTextEmotion.v2	0.5655	0.6091
PersianWebDocumentRetrieval	0.5282	0.4676	0.5067
QuoraRetrieval-Fa.v2	0.7430	0.7788
SAMSumFa	0.9812	0.9242	0.9247
SCIDOCS-Fa.v2	0.1445	0.1222
SIDClassification.v2	0.6327	0.6137
SIDClustring	0.4858	0.3865	0.4102
SciFact-Fa.v2	0.6523	0.6037
StyleClassification	0.6258	0.6492
SynPerChatbotConvSAAnger	0.8237	0.7193	0.8661
SynPerChatbotConvSAClassification	0.6979	0.6077	0.7472
SynPerChatbotConvSAFear	0.8376	0.7419	0.7769
SynPerChatbotConvSAFriendship	0.5420	0.5283	0.6268
SynPerChatbotConvSAHappiness	0.6263	0.5246	0.7398
SynPerChatbotConvSAJealousy	0.7517	0.7034	0.7621
SynPerChatbotConvSALove	0.5286	0.4629	0.6086
SynPerChatbotConvSASadness	0.7598	0.6441	0.8167
SynPerChatbotConvSASatisfaction	0.7879	0.6058	0.8079
SynPerChatbotConvSASurprise	0.6240	0.5388	0.7198
SynPerChatbotConvSAToneChatbotClassification	0.6458	0.5807	0.8198
SynPerChatbotConvSAToneUserClassification	0.5450	0.5260	0.6197
SynPerChatbotRAGFAQPC	0.6659	0.6303	0.6677
SynPerChatbotRAGFAQRetrieval	0.3401	0.2348	0.4405
SynPerChatbotRAGSumSRetrieval	0.6593	0.4981	0.6037
SynPerChatbotSatisfactionLevelClassification	0.3197	0.2523	0.3343
SynPerChatbotSumSRetrieval	0.4457	0.2760	0.3678
SynPerQAPC	0.9355	0.9516	0.9320
SynPerQARetrieval	0.8628	0.8735	0.8681
SynPerSTS	0.8707	0.8798	0.8691
SynPerTextKeywordsPC	0.9650	0.9479	0.9640
SynPerTextToneClassification.v3	0.7377	0.8412
TRECCOVID-Fa.v2	0.7341	0.7177
WebFAQRetrieval	0.7806	0.7459	0.7813
Average	0.6698	0.6279	0.7112

Model have high performance on these tasks: BeytooteClustering,DigikalamagClassification,DigikalamagClustering,FarsTail,Farsick,HamshahriClustring,NLPTwitterAnalysisClustering,NeuCLIR2023RetrievalHardNegatives,ParsinluEntail,ParsinluQueryParaphPC,PersianFoodSentimentClassification,PersianWebDocumentRetrieval,SAMSumFa,SIDClustring,SynPerChatbotConvSAFear,SynPerChatbotRAGSumSRetrieval,SynPerChatbotSumSRetrieval,SynPerQAPC,SynPerSTS,SynPerTextKeywordsPC

Results for `intfloat/e5-mistral-7b-instruct`

task_name	intfloat/e5-mistral-7b-instruct	intfloat/multilingual-e5-large	Max result
ArguAna-Fa.v2	0.4652	0.4127
DeepSentiPers.v2	0.5474	0.5769
FEVER-FaHardNegatives	0.6563	0.4615
FiQA2018-Fa.v2	0.2293	0.2946
HotpotQA-FaHardNegatives	0.4643	0.6153
MSMARCO-FaHardNegatives	0.6256	0.6871
NLPTwitterAnalysisClassification.v2	0.7582	0.7659
NQ-FaHardNegatives	0.3873	0.4983
PerShopDomainClassification	0.2709	0.5517
PerShopIntentClassification	0.8724	0.9069
PersianTextEmotion.v2	0.3933	0.6091
QuoraRetrieval-Fa.v2	0.7783	0.7788
SCIDOCS-Fa.v2	0.1404	0.1222
SIDClassification.v2	0.5819	0.6137
SciFact-Fa.v2	0.5552	0.6037
StyleClassification	0.5604	0.6492
SynPerTextToneClassification.v3	0.6149	0.8412
TRECCOVID-Fa.v2	0.6498	0.7177
Touche2020-Fa.v2	0.4352	0.4978
WebFAQRetrieval	0.7224	0.7459	0.7813
Average	0.5354	0.5975	0.7813

Results for `intfloat/multilingual-e5-base`

task_name	intfloat/multilingual-e5-base	intfloat/multilingual-e5-large
ArguAna-Fa.v2	0.3384	0.4127
DeepSentiPers.v2	0.5897	0.5769
FEVER-FaHardNegatives	0.5419	0.4615
FiQA2018-Fa.v2	0.2298	0.2946
HotpotQA-FaHardNegatives	0.5657	0.6153
MSMARCO-FaHardNegatives	0.6667	0.6871
NLPTwitterAnalysisClassification.v2	0.7492	0.7659
NQ-FaHardNegatives	0.4497	0.4983
PerShopDomainClassification	0.5054	0.5517
PerShopIntentClassification	0.9003	0.9069
PersianTextEmotion.v2	0.5281	0.6091
QuoraRetrieval-Fa.v2	0.7468	0.7788
SCIDOCS-Fa.v2	0.1182	0.1222
SIDClassification.v2	0.6073	0.6137
SciFact-Fa.v2	0.5818	0.6037
StyleClassification	0.6172	0.6492
SynPerTextToneClassification.v3	0.8018	0.8412
TRECCOVID-Fa.v2	0.6291	0.7177
Touche2020-Fa.v2	0.4594	0.4978
Average	0.5593	0.5897

Results for `intfloat/multilingual-e5-large`

task_name	google/gemini-embedding-001	intfloat/multilingual-e5-large	Max result
ArguAna-Fa.v2	nan	0.4127
DeepSentiPers.v2	nan	0.5769
FEVER-FaHardNegatives	nan	0.4615
FiQA2018-Fa.v2	nan	0.2946
HotpotQA-FaHardNegatives	nan	0.6153
MIRACLRetrievalHardNegatives	0.6163	0.5923	0.6257
MSMARCO-FaHardNegatives	nan	0.6871
NLPTwitterAnalysisClassification.v2	nan	0.7659
NQ-FaHardNegatives	nan	0.4983
NeuCLIR2023RetrievalHardNegatives	nan	0.5059	0.5950
PerShopDomainClassification	nan	0.5517
PerShopIntentClassification	nan	0.9069
PersianTextEmotion.v2	nan	0.6091
QuoraRetrieval-Fa.v2	nan	0.7788
SCIDOCS-Fa.v2	nan	0.1222
SIDClassification.v2	nan	0.6137
SciFact-Fa.v2	nan	0.6037
StyleClassification	nan	0.6492
SynPerTextToneClassification.v3	nan	0.8412
TRECCOVID-Fa.v2	nan	0.7177
Touche2020-Fa.v2	nan	0.4978
Average	0.6163	0.5858	0.6103

Results for `jinaai/jina-embeddings-v3`

task_name	intfloat/multilingual-e5-large	jinaai/jina-embeddings-v3	Max result
ArguAna-Fa.v2	0.4127	0.3852
DeepSentiPers.v2	0.5769	0.6407
FEVER-FaHardNegatives	0.4615	0.7050
FiQA2018-Fa.v2	0.2946	0.3551
HotpotQA-FaHardNegatives	0.6153	0.5401
MSMARCO-FaHardNegatives	0.6871	0.6740
NLPTwitterAnalysisClassification.v2	0.7659	0.7677
NQ-FaHardNegatives	0.4983	0.5305
NeuCLIR2023RetrievalHardNegatives	0.5059	0.5896	0.5950
PerShopDomainClassification	0.5517	0.5644
PerShopIntentClassification	0.9069	0.8583
PersianTextEmotion.v2	0.6091	0.5217
QuoraRetrieval-Fa.v2	0.7788	0.5715
SCIDOCS-Fa.v2	0.1222	0.1426
SIDClassification.v2	0.6137	0.6165
SciFact-Fa.v2	0.6037	0.6123
StyleClassification	0.6492	0.6133
SynPerTextToneClassification.v3	0.8412	0.7064
TRECCOVID-Fa.v2	0.7177	0.6858
Touche2020-Fa.v2	0.4978	0.4901
Average	0.5855	0.5785	0.5950

Results for `m3hrdadfi/bert-zwnj-wnli-mean-tokens`

task_name	google/gemini-embedding-001	intfloat/multilingual-e5-large	m3hrdadfi/bert-zwnj-wnli-mean-tokens	Max result
ArguAna-Fa.v2	nan	0.4127	0.2087
DeepSentiPers.v2	nan	0.5769	0.4511
FEVER-FaHardNegatives	nan	0.4615	0.0355
FiQA2018-Fa.v2	nan	0.2946	0.0207
HotpotQA-FaHardNegatives	nan	0.6153	0.0232
MIRACLRetrievalHardNegatives	0.6163	0.5923	0.0797	0.6257
MSMARCO-FaHardNegatives	nan	0.6871	0.2251
NLPTwitterAnalysisClassification.v2	nan	0.7659	0.7129
NQ-FaHardNegatives	nan	0.4983	0.0374
NeuCLIR2023RetrievalHardNegatives	nan	0.5059	0.1744	0.5950
PerShopDomainClassification	nan	0.5517	0.6182
PerShopIntentClassification	nan	0.9069	0.8858
PersianTextEmotion.v2	nan	0.6091	0.3833
QuoraRetrieval-Fa.v2	nan	0.7788	0.4677
SCIDOCS-Fa.v2	nan	0.1222	0.0181
SIDClassification.v2	nan	0.6137	0.4827
SciFact-Fa.v2	nan	0.6037	0.0742
StyleClassification	nan	0.6492	0.8198
SynPerTextToneClassification.v3	nan	0.8412	0.8910
TRECCOVID-Fa.v2	nan	0.7177	0.1643
Touche2020-Fa.v2	nan	0.4978	0.1279
WebFAQRetrieval	nan	0.7459	0.1780	0.7813
Average	0.6163	0.5931	0.3218	0.6673

Results for `m3hrdadfi/roberta-zwnj-wnli-mean-tokens`

task_name	google/gemini-embedding-001	intfloat/multilingual-e5-large	m3hrdadfi/roberta-zwnj-wnli-mean-tokens	Max result
ArguAna-Fa.v2	nan	0.4127	0.2202
DeepSentiPers.v2	nan	0.5769	0.4354
FEVER-FaHardNegatives	nan	0.4615	0.0262
FiQA2018-Fa.v2	nan	0.2946	0.0321
HotpotQA-FaHardNegatives	nan	0.6153	0.0266
MIRACLRetrievalHardNegatives	0.6163	0.5923	0.0725	0.6257
MSMARCO-FaHardNegatives	nan	0.6871	0.3186
NLPTwitterAnalysisClassification.v2	nan	0.7659	0.7092
NQ-FaHardNegatives	nan	0.4983	0.0440
NeuCLIR2023RetrievalHardNegatives	nan	0.5059	0.1857	0.5950
PerShopDomainClassification	nan	0.5517	0.5375
PerShopIntentClassification	nan	0.9069	0.8766
PersianTextEmotion.v2	nan	0.6091	0.3756
QuoraRetrieval-Fa.v2	nan	0.7788	0.4789
SCIDOCS-Fa.v2	nan	0.1222	0.0343
SIDClassification.v2	nan	0.6137	0.5041
SciFact-Fa.v2	nan	0.6037	0.0784
StyleClassification	nan	0.6492	0.8182
SynPerTextToneClassification.v3	nan	0.8412	0.8930
TRECCOVID-Fa.v2	nan	0.7177	0.2010
Touche2020-Fa.v2	nan	0.4978	0.1387
WebFAQRetrieval	nan	0.7459	0.1835	0.7813
Average	0.6163	0.5931	0.3268	0.6673

Results for `myrkur/sentence-transformer-parsbert-fa`

task_name	google/gemini-embedding-001	intfloat/multilingual-e5-large	myrkur/sentence-transformer-parsbert-fa	Max result
ArguAna-Fa.v2	nan	0.4127	0.2103
DeepSentiPers.v2	nan	0.5769	0.4116
FEVER-FaHardNegatives	nan	0.4615	0.0265
FiQA2018-Fa.v2	nan	0.2946	0.0100
HotpotQA-FaHardNegatives	nan	0.6153	0.0132
MIRACLRetrievalHardNegatives	0.6163	0.5923	0.0537	0.6257
MSMARCO-FaHardNegatives	nan	0.6871	0.2412
NLPTwitterAnalysisClassification.v2	nan	0.7659	0.7464
NQ-FaHardNegatives	nan	0.4983	0.0171
NeuCLIR2023RetrievalHardNegatives	nan	0.5059	0.2394	0.5950
PerShopDomainClassification	nan	0.5517	0.6930
PerShopIntentClassification	nan	0.9069	0.8582
PersianTextEmotion.v2	nan	0.6091	0.3882
QuoraRetrieval-Fa.v2	nan	0.7788	0.4700
SCIDOCS-Fa.v2	nan	0.1222	0.0206
SIDClassification.v2	nan	0.6137	0.5500
SciFact-Fa.v2	nan	0.6037	0.0496
StyleClassification	nan	0.6492	0.7458
SynPerTextToneClassification.v3	nan	0.8412	0.7516
TRECCOVID-Fa.v2	nan	0.7177	0.1005
Touche2020-Fa.v2	nan	0.4978	0.0399
WebFAQRetrieval	nan	0.7459	0.1252	0.7813
Average	0.6163	0.5931	0.3074	0.6673

Results for `openai/text-embedding-3-small`

task_name	google/gemini-embedding-001	intfloat/multilingual-e5-large	openai/text-embedding-3-small	Max result
ArguAna-Fa.v2	nan	0.4127	0.3328
BeytooteClustering	nan	0.6150	0.6038	0.6252
DeepSentiPers.v2	nan	0.5769	0.5044
DigikalamagClassification	nan	0.8705	0.8493	0.8631
DigikalamagClustering	nan	0.3989	0.4597	0.4748
FEVER-FaHardNegatives	nan	0.4615	0.2722
FarsTail	nan	0.7255	0.6885	0.7478
FarsiParaphraseDetection	nan	0.9757	0.9483	0.9706
Farsick	nan	0.7067	0.6085	0.7095
FiQA2018-Fa.v2	nan	0.2946	0.1011
HamshahriClustring	nan	0.6742	0.6633	0.6983
HotpotQA-FaHardNegatives	nan	0.6153	0.2813
MIRACLReranking	nan	0.5936	0.3477	0.6026
MIRACLRetrievalHardNegatives	0.6163	0.5923	0.2724	0.6257
MSMARCO-FaHardNegatives	nan	0.6871	0.4767
MassiveIntentClassification	0.8349	0.6549	0.5217	0.8349
MassiveScenarioClassification	0.8863	0.6859	0.5667	0.8863
NLPTwitterAnalysisClassification.v2	nan	0.7659	0.7203
NLPTwitterAnalysisClustering	nan	0.7848	0.7892	0.8082
NQ-FaHardNegatives	nan	0.4983	0.1996
NeuCLIR2023RetrievalHardNegatives	nan	0.5059	0.3452	0.5950
ParsinluEntail	nan	0.6546	0.5747	0.6655
ParsinluQueryParaphPC	nan	0.8783	0.7701	0.8709
PerShopDomainClassification	nan	0.5517	0.4921
PerShopIntentClassification	nan	0.9069	0.8870
PersianFoodSentimentClassification	nan	0.8212	0.6615	0.8105
PersianTextEmotion.v2	nan	0.6091	0.4459
PersianWebDocumentRetrieval	nan	0.4676	0.3508	0.5067
QuoraRetrieval-Fa.v2	nan	0.7788	0.6223
SAMSumFa	nan	0.9242	0.8461	0.9247
SCIDOCS-Fa.v2	nan	0.1222	0.0840
SIDClassification.v2	nan	0.6137	0.5272
SIDClustring	nan	0.3865	0.3568	0.4102
SciFact-Fa.v2	nan	0.6037	0.4125
StyleClassification	nan	0.6492	0.6229
SynPerChatbotConvSAAnger	nan	0.7193	0.8539	0.8661
SynPerChatbotConvSAClassification	nan	0.6077	0.7151	0.7472
SynPerChatbotConvSAFear	nan	0.7419	0.7641	0.7769
SynPerChatbotConvSAFriendship	nan	0.5283	0.5891	0.6268
SynPerChatbotConvSAHappiness	nan	0.5246	0.6398	0.7398
SynPerChatbotConvSAJealousy	nan	0.7034	0.7276	0.7621
SynPerChatbotConvSALove	nan	0.4629	0.5200	0.6086
SynPerChatbotConvSASadness	nan	0.6441	0.7922	0.8167
SynPerChatbotConvSASatisfaction	nan	0.6058	0.8664	0.8079
SynPerChatbotConvSASurprise	nan	0.5388	0.6826	0.7198
SynPerChatbotConvSAToneChatbotClassification	nan	0.5807	0.6666	0.8198
SynPerChatbotConvSAToneUserClassification	nan	0.5260	0.6037	0.6197
SynPerChatbotRAGFAQPC	nan	0.6303	0.6896	0.6677
SynPerChatbotRAGFAQRetrieval	nan	0.2348	0.2826	0.4405
SynPerChatbotRAGSumSRetrieval	nan	0.4981	0.4781	0.6037
SynPerChatbotSatisfactionLevelClassification	nan	0.2523	0.3664	0.3343
SynPerChatbotSumSRetrieval	nan	0.2760	0.2216	0.3678
SynPerQAPC	nan	0.9516	0.9056	0.9320
SynPerQARetrieval	nan	0.8735	0.6358	0.8681
SynPerSTS	nan	0.8798	0.7733	0.8691
SynPerTextKeywordsPC	nan	0.9479	0.9146	0.9640
SynPerTextToneClassification.v3	nan	0.8412	0.7289
TRECCOVID-Fa.v2	nan	0.7177	0.2937
WebFAQRetrieval	nan	0.7459	0.5044	0.7813
WikipediaRerankingMultilingual	0.9120	0.8932	0.8094	0.9120
WikipediaRetrievalMultilingual	0.9357	0.9040	0.7532	0.9357
Average	0.8370	0.6376	0.5735	0.7260

Model have high performance on these tasks: SynPerChatbotConvSASatisfaction,SynPerChatbotRAGFAQPC,SynPerChatbotSatisfactionLevelClassification

Results for `sbunlp/fabert`

task_name	google/gemini-embedding-001	intfloat/multilingual-e5-large	sbunlp/fabert	Max result
ArguAna-Fa.v2	nan	0.4127	0.1913
DeepSentiPers.v2	nan	0.5769	0.4175
FEVER-FaHardNegatives	nan	0.4615	0.0463
FiQA2018-Fa.v2	nan	0.2946	0.0367
HotpotQA-FaHardNegatives	nan	0.6153	0.0926
MIRACLRetrievalHardNegatives	0.6163	0.5923	0.1285	0.6257
MSMARCO-FaHardNegatives	nan	0.6871	0.2718
NLPTwitterAnalysisClassification.v2	nan	0.7659	0.7106
NQ-FaHardNegatives	nan	0.4983	0.0783
NeuCLIR2023RetrievalHardNegatives	nan	0.5059	0.3032	0.5950
PerShopDomainClassification	nan	0.5517	0.5465
PerShopIntentClassification	nan	0.9069	0.8962
PersianTextEmotion.v2	nan	0.6091	0.4884
QuoraRetrieval-Fa.v2	nan	0.7788	0.5246
SCIDOCS-Fa.v2	nan	0.1222	0.0411
SIDClassification.v2	nan	0.6137	0.5400
SciFact-Fa.v2	nan	0.6037	0.1848
StyleClassification	nan	0.6492	0.9771
SynPerTextToneClassification.v3	nan	0.8412	0.9840
TRECCOVID-Fa.v2	nan	0.7177	0.1810
Touche2020-Fa.v2	nan	0.4978	0.0705
WebFAQRetrieval	nan	0.7459	0.2758	0.7813
Average	0.6163	0.5931	0.3630	0.6673

Results for `sentence-transformers/LaBSE`

task_name	intfloat/multilingual-e5-large	sentence-transformers/LaBSE	Max result
ArguAna-Fa.v2	0.4127	0.3812
DeepSentiPers.v2	0.5769	0.5805
FEVER-FaHardNegatives	0.4615	0.1350
FiQA2018-Fa.v2	0.2946	0.0550
HotpotQA-FaHardNegatives	0.6153	0.1619
MSMARCO-FaHardNegatives	0.6871	0.3307
NLPTwitterAnalysisClassification.v2	0.7659	0.7536
NQ-FaHardNegatives	0.4983	0.1221
PerShopDomainClassification	0.5517	0.5665
PerShopIntentClassification	0.9069	0.9284
PersianTextEmotion.v2	0.6091	0.5333
QuoraRetrieval-Fa.v2	0.7788	0.7026
SCIDOCS-Fa.v2	0.1222	0.0713
SIDClassification.v2	0.6137	0.5672
SciFact-Fa.v2	0.6037	0.3387
StyleClassification	0.6492	0.5664
SynPerTextToneClassification.v3	0.8412	0.6849
TRECCOVID-Fa.v2	0.7177	0.2083
Touche2020-Fa.v2	0.4978	0.1362
WebFAQRetrieval	0.7459	0.3671	0.7813
Average	0.5975	0.4095	0.7813

Results for `sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2`

task_name	intfloat/multilingual-e5-large	sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
ArguAna-Fa.v2	0.4127	0.3900
DeepSentiPers.v2	0.5769	0.5602
FEVER-FaHardNegatives	0.4615	0.2021
FiQA2018-Fa.v2	0.2946	0.0956
HotpotQA-FaHardNegatives	0.6153	0.1341
MSMARCO-FaHardNegatives	0.6871	0.4321
NLPTwitterAnalysisClassification.v2	0.7659	0.7547
NQ-FaHardNegatives	0.4983	0.1650
PerShopDomainClassification	0.5517	0.5642
PerShopIntentClassification	0.9069	0.8562
PersianTextEmotion.v2	0.6091	0.4471
QuoraRetrieval-Fa.v2	0.7788	0.7148
SCIDOCS-Fa.v2	0.1222	0.0879
SIDClassification.v2	0.6137	0.5447
SciFact-Fa.v2	0.6037	0.3195
StyleClassification	0.6492	0.5122
SynPerTextToneClassification.v3	0.8412	0.6079
TRECCOVID-Fa.v2	0.7177	0.3495
Touche2020-Fa.v2	0.4978	0.3442
Average	0.5897	0.4254

mehran-sarmadi · 2025-09-22T09:22:44Z

Hi @KennethEnevoldsen , could you please take a look at this PR and let me know if everything looks good or if I should make changes? Thanks!

KennethEnevoldsen · 2025-09-22T13:58:41Z

thanks for the ping :)

mehran and others added 2 commits September 7, 2025 13:48

add v2 results

f5dadeb

add resutls for text-embedding-3-small and embeddinggemma-300m

e49d1d4

KennethEnevoldsen added the waiting for review of implementation This PR is waiting for an implementation review before merging the results. label Sep 9, 2025

mehran-sarmadi added 3 commits September 15, 2025 18:42

update results

d9f18d1

update hakim-unsup results

f9346fe

remove an unused dataset

ea4a407

KennethEnevoldsen enabled auto-merge (squash) September 22, 2025 13:58

KennethEnevoldsen disabled auto-merge September 22, 2025 13:58

KennethEnevoldsen merged commit 5af2720 into embeddings-benchmark:main Sep 22, 2025
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Famteb v2 Results #273

Famteb v2 Results #273

Uh oh!

mehran-sarmadi commented Sep 8, 2025

Uh oh!

github-actions bot commented Sep 21, 2025

Uh oh!

mehran-sarmadi commented Sep 22, 2025

Uh oh!

Uh oh!

KennethEnevoldsen commented Sep 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Famteb v2 Results #273

Famteb v2 Results #273

Uh oh!

Conversation

mehran-sarmadi commented Sep 8, 2025

Checklist

Uh oh!

github-actions bot commented Sep 21, 2025

Model Results Comparison

Results for Alibaba-NLP/gte-Qwen2-7B-instruct

Results for BAAI/bge-m3

Results for HooshvareLab/bert-base-parsbert-uncased

Results for MCINext/Hakim-small

Results for MCINext/Hakim-unsup

Results for MCINext/Hakim

Results for PartAI/Tooka-SBERT-V2-Large

Results for PartAI/Tooka-SBERT-V2-Small

Results for PartAI/Tooka-SBERT

Results for PartAI/TookaBERT-Base

Results for google/embeddinggemma-300m

Results for intfloat/e5-mistral-7b-instruct

Results for intfloat/multilingual-e5-base

Results for intfloat/multilingual-e5-large

Results for jinaai/jina-embeddings-v3

Results for m3hrdadfi/bert-zwnj-wnli-mean-tokens

Results for m3hrdadfi/roberta-zwnj-wnli-mean-tokens

Results for myrkur/sentence-transformer-parsbert-fa

Results for openai/text-embedding-3-small

Results for sbunlp/fabert

Results for sentence-transformers/LaBSE

Results for sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2

Uh oh!

mehran-sarmadi commented Sep 22, 2025

Uh oh!

Uh oh!

KennethEnevoldsen commented Sep 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Results for `Alibaba-NLP/gte-Qwen2-7B-instruct`

Results for `BAAI/bge-m3`

Results for `HooshvareLab/bert-base-parsbert-uncased`

Results for `MCINext/Hakim-small`

Results for `MCINext/Hakim-unsup`

Results for `MCINext/Hakim`

Results for `PartAI/Tooka-SBERT-V2-Large`

Results for `PartAI/Tooka-SBERT-V2-Small`

Results for `PartAI/Tooka-SBERT`

Results for `PartAI/TookaBERT-Base`

Results for `google/embeddinggemma-300m`

Results for `intfloat/e5-mistral-7b-instruct`

Results for `intfloat/multilingual-e5-base`

Results for `intfloat/multilingual-e5-large`

Results for `jinaai/jina-embeddings-v3`

Results for `m3hrdadfi/bert-zwnj-wnli-mean-tokens`

Results for `m3hrdadfi/roberta-zwnj-wnli-mean-tokens`

Results for `myrkur/sentence-transformer-parsbert-fa`

Results for `openai/text-embedding-3-small`

Results for `sbunlp/fabert`

Results for `sentence-transformers/LaBSE`

Results for `sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2`