Skip to content

add Results of Hakims and Tooka-SBERTV2s#221

Merged
KennethEnevoldsen merged 2 commits intoembeddings-benchmark:mainfrom
mehran-sarmadi:hakim-and-tookaV2s
Jul 8, 2025
Merged

add Results of Hakims and Tooka-SBERTV2s#221
KennethEnevoldsen merged 2 commits intoembeddings-benchmark:mainfrom
mehran-sarmadi:hakim-and-tookaV2s

Conversation

@mehran-sarmadi
Copy link
Contributor

In this PR, we add results for five models: three from the Hakim family (hakim, hakim-small, and hakim-unsup), and two from the TookaSBERTV2 family (TookaSBERT-V2-Small and TookaSBERT-V2-Large).

Checklist

  • My model has a model sheet, report or similar
  • [] My model has a reference implementation in mteb/models/ this can be as an API. Instruction on how to add a model can be found here
  • The results submitted is obtained using the reference implementation
  • My model is available, either as a publicly accessible API or publicly on e.g., Huggingface
  • I solemnly swear that for all results submitted I have not on the evaluation dataset including training splits. If I have I have disclosed it clearly.

@mehran-sarmadi
Copy link
Contributor Author

Hi @Samoed,
Since the issue mentioned here seems to be resolved (#222), would rerunning the compare-results check help, or should I make changes on my side?

@Samoed
Copy link
Member

Samoed commented Jun 25, 2025

I think you need to update your PR firstly, because restarting action would restart only old version. But for now I think you need to finish integration of your model

@KennethEnevoldsen
Copy link
Contributor

@Samoed, it seems like the CI fails here. Any idea why?

@Samoed
Copy link
Member

Samoed commented Jul 4, 2025

I think this action before all modifications and to update it main should be merged to this branch

@github-actions
Copy link

github-actions bot commented Jul 5, 2025

Model Results Comparison

Reference models: intfloat/multilingual-e5-large, google/gemini-embedding-001
New models evaluated: MCINext/Hakim-small, MCINext/Hakim-unsup, MCINext/Hakim, PartAI/Tooka-SBERT-V2-Large, PartAI/Tooka-SBERT-V2-Small
Tasks: ArguAna-Fa, BeytooteClustering, CExaPPC, CQADupstackAndroidRetrieval-Fa, CQADupstackEnglishRetrieval-Fa, CQADupstackGamingRetrieval-Fa, CQADupstackGisRetrieval-Fa, CQADupstackMathematicaRetrieval-Fa, CQADupstackPhysicsRetrieval-Fa, CQADupstackProgrammersRetrieval-Fa, CQADupstackRetrieval-Fa, CQADupstackStatsRetrieval-Fa, CQADupstackTexRetrieval-Fa, CQADupstackUnixRetrieval-Fa, CQADupstackWebmastersRetrieval-Fa, CQADupstackWordpressRetrieval-Fa, ClimateFEVER-Fa, DBPedia-Fa, DeepSentiPers, DigikalamagClassification, DigikalamagClustering, FarsTail, FarsiParaphraseDetection, Farsick, FiQA2018-Fa, HamshahriClustring, HotpotQA-Fa, MIRACLReranking, MIRACLRetrieval, MSMARCO-Fa, MassiveIntentClassification, MassiveScenarioClassification, NFCorpus-Fa, NLPTwitterAnalysisClassification, NLPTwitterAnalysisClustering, NQ-Fa, ParsinluEntail, ParsinluQueryParaphPC, PersianFoodSentimentClassification, PersianTextEmotion, PersianWebDocumentRetrieval, Query2Query, QuoraRetrieval-Fa, SAMSumFa, SCIDOCS-Fa, SIDClassification, SIDClustring, SciFact-Fa, SentimentDKSF, SynPerChatbotConvSAAnger, SynPerChatbotConvSAClassification, SynPerChatbotConvSAFear, SynPerChatbotConvSAFriendship, SynPerChatbotConvSAHappiness, SynPerChatbotConvSAJealousy, SynPerChatbotConvSALove, SynPerChatbotConvSASadness, SynPerChatbotConvSASatisfaction, SynPerChatbotConvSASurprise, SynPerChatbotConvSAToneChatbotClassification, SynPerChatbotConvSAToneUserClassification, SynPerChatbotRAGFAQPC, SynPerChatbotRAGFAQRetrieval, SynPerChatbotRAGSumSRetrieval, SynPerChatbotRAGToneChatbotClassification, SynPerChatbotRAGToneUserClassification, SynPerChatbotRAGTopicsRetrieval, SynPerChatbotSatisfactionLevelClassification, SynPerChatbotSumSRetrieval, SynPerChatbotToneChatbotClassification, SynPerChatbotToneUserClassification, SynPerChatbotTopicsRetrieval, SynPerQAPC, SynPerQARetrieval, SynPerSTS, SynPerTextKeywordsPC, SynPerTextToneClassification, TRECCOVID-Fa, Touche2020-Fa, WikipediaRerankingMultilingual, WikipediaRetrievalMultilingual

Results for MCINext/Hakim-small

task_name MCINext/Hakim-small google/gemini-embedding-001 intfloat/multilingual-e5-large Max result
ArguAna-Fa 0.40 nan 0.45 0.62
BeytooteClustering 0.63 nan 0.62 0.68
CExaPPC 0.98 nan 0.99 0.99
CQADupstackAndroidRetrieval-Fa 0.15 nan 0.42 0.47
CQADupstackEnglishRetrieval-Fa 0.12 nan 0.30 0.35
CQADupstackGamingRetrieval-Fa 0.18 nan 0.46 0.49
CQADupstackGisRetrieval-Fa 0.11 nan 0.30 0.35
CQADupstackMathematicaRetrieval-Fa 0.10 nan 0.19 0.26
CQADupstackPhysicsRetrieval-Fa 0.17 nan 0.37 0.41
CQADupstackProgrammersRetrieval-Fa 0.15 nan 0.35 0.38
CQADupstackRetrieval-Fa 0.12 nan 0.32 0.36
CQADupstackStatsRetrieval-Fa 0.12 nan 0.28 0.30
CQADupstackTexRetrieval-Fa 0.07 nan 0.20 0.25
CQADupstackUnixRetrieval-Fa 0.12 nan 0.32 0.37
CQADupstackWebmastersRetrieval-Fa 0.13 nan 0.34 0.38
CQADupstackWordpressRetrieval-Fa 0.08 nan 0.26 0.30
ClimateFEVER-Fa 0.16 nan 0.13 0.30
DBPedia-Fa 0.21 nan 0.30 0.37
DeepSentiPers 0.67 nan 0.61 0.73
DigikalamagClassification 0.90 nan 0.87 0.91
DigikalamagClustering 0.67 nan 0.40 0.79
FarsTail 0.71 nan 0.73 0.82
FarsiParaphraseDetection 0.99 nan 0.98 1.00
Farsick 0.68 nan 0.71 0.77
FiQA2018-Fa 0.19 nan 0.30 0.37
HamshahriClustring 0.68 nan 0.67 0.76
HotpotQA-Fa 0.42 nan 0.60 0.61
MIRACLReranking 0.48 nan 0.65 0.66
MIRACLRetrieval 0.41 nan 0.59 0.72
MSMARCO-Fa 0.19 nan 0.31 0.31
MassiveIntentClassification 0.64 0.82 0.60 0.92
MassiveScenarioClassification 0.78 0.87 0.70 0.99
NFCorpus-Fa 0.23 nan 0.29 0.31
NLPTwitterAnalysisClassification 0.78 nan 0.76 0.79
NLPTwitterAnalysisClustering 0.84 nan 0.78 0.86
NQ-Fa 0.24 nan 0.45 0.50
ParsinluEntail 0.66 nan 0.65 0.78
ParsinluQueryParaphPC 0.85 nan 0.88 0.90
PersianFoodSentimentClassification 0.84 nan 0.82 0.87
PersianTextEmotion 0.87 nan 0.62 0.92
PersianWebDocumentRetrieval 0.35 nan 0.47 0.57
Query2Query 0.76 nan 0.67 0.82
QuoraRetrieval-Fa 0.74 nan 0.80 0.82
SAMSumFa 0.98 nan 0.92 0.99
SCIDOCS-Fa 0.10 nan 0.12 0.18
SIDClassification 0.65 nan 0.61 0.68
SIDClustring 0.49 nan 0.39 0.55
SciFact-Fa 0.52 nan 0.60 0.74
SentimentDKSF 0.80 nan 0.71 0.83
SynPerChatbotConvSAAnger 0.96 nan 0.72 0.97
SynPerChatbotConvSAClassification 0.87 nan 0.61 0.90
SynPerChatbotConvSAFear 0.90 nan 0.74 0.96
SynPerChatbotConvSAFriendship 0.71 nan 0.53 0.75
SynPerChatbotConvSAHappiness 0.91 nan 0.52 0.92
SynPerChatbotConvSAJealousy 0.76 nan 0.70 0.84
SynPerChatbotConvSALove 0.86 nan 0.46 0.86
SynPerChatbotConvSASadness 0.93 nan 0.64 0.95
SynPerChatbotConvSASatisfaction 0.97 nan 0.61 0.98
SynPerChatbotConvSASurprise 0.79 nan 0.54 0.86
SynPerChatbotConvSAToneChatbotClassification 0.98 nan 0.58 0.99
SynPerChatbotConvSAToneUserClassification 0.91 nan 0.53 0.97
SynPerChatbotRAGFAQPC 0.84 nan 0.63 0.93
SynPerChatbotRAGFAQRetrieval 0.49 nan 0.23 0.54
SynPerChatbotRAGSumSRetrieval 0.75 nan 0.50 0.80
SynPerChatbotRAGToneChatbotClassification 0.87 nan 0.37 0.90
SynPerChatbotRAGToneUserClassification 0.85 nan 0.51 0.93
SynPerChatbotRAGTopicsRetrieval 0.44 nan 0.19 0.50
SynPerChatbotSatisfactionLevelClassification 0.54 nan 0.25 0.60
SynPerChatbotSumSRetrieval 0.63 nan 0.28 0.70
SynPerChatbotToneChatbotClassification 0.92 nan 0.41 0.95
SynPerChatbotToneUserClassification 0.90 nan 0.47 0.95
SynPerChatbotTopicsRetrieval 0.46 nan 0.12 0.52
SynPerQAPC 0.97 nan 0.95 0.98
SynPerQARetrieval 0.88 nan 0.87 0.90
SynPerSTS 0.86 nan 0.88 0.90
SynPerTextKeywordsPC 0.98 nan 0.95 0.99
SynPerTextToneClassification 0.85 nan 0.70 0.91
TRECCOVID-Fa 0.42 nan 0.72 0.77
Touche2020-Fa 0.16 nan 0.26 0.26
WikipediaRerankingMultilingual 0.87 0.92 0.89 0.92
WikipediaRetrievalMultilingual 0.85 0.94 0.90 0.94
Average 0.59 0.89 0.54 0.70

Results for MCINext/Hakim-unsup

task_name MCINext/Hakim-unsup google/gemini-embedding-001 intfloat/multilingual-e5-large Max result
ArguAna-Fa 0.37 nan 0.45 0.62
BeytooteClustering 0.60 nan 0.62 0.68
CExaPPC 0.96 nan 0.99 0.99
CQADupstackAndroidRetrieval-Fa 0.38 nan 0.42 0.47
CQADupstackEnglishRetrieval-Fa 0.30 nan 0.30 0.35
CQADupstackGamingRetrieval-Fa 0.41 nan 0.46 0.49
CQADupstackGisRetrieval-Fa 0.24 nan 0.30 0.35
CQADupstackMathematicaRetrieval-Fa 0.18 nan 0.19 0.26
CQADupstackPhysicsRetrieval-Fa 0.35 nan 0.37 0.41
CQADupstackProgrammersRetrieval-Fa 0.27 nan 0.35 0.38
CQADupstackRetrieval-Fa 0.27 nan 0.32 0.36
CQADupstackStatsRetrieval-Fa 0.23 nan 0.28 0.30
CQADupstackTexRetrieval-Fa 0.16 nan 0.20 0.25
CQADupstackUnixRetrieval-Fa 0.26 nan 0.32 0.37
CQADupstackWebmastersRetrieval-Fa 0.30 nan 0.34 0.38
CQADupstackWordpressRetrieval-Fa 0.21 nan 0.26 0.30
ClimateFEVER-Fa 0.16 nan 0.13 0.30
DBPedia-Fa 0.26 nan 0.30 0.37
DeepSentiPers 0.66 nan 0.61 0.73
DigikalamagClassification 0.85 nan 0.87 0.91
DigikalamagClustering 0.44 nan 0.40 0.79
FarsTail 0.79 nan 0.73 0.82
FarsiParaphraseDetection 0.95 nan 0.98 1.00
Farsick 0.72 nan 0.71 0.77
FiQA2018-Fa 0.18 nan 0.30 0.37
HamshahriClustring 0.69 nan 0.67 0.76
HotpotQA-Fa 0.39 nan 0.60 0.61
MIRACLReranking 0.53 nan 0.65 0.66
MIRACLRetrieval 0.50 nan 0.59 0.72
MSMARCO-Fa 0.22 nan 0.31 0.31
MassiveIntentClassification 0.67 0.82 0.60 0.92
MassiveScenarioClassification 0.71 0.87 0.70 0.99
NFCorpus-Fa 0.29 nan 0.29 0.31
NLPTwitterAnalysisClassification 0.76 nan 0.76 0.79
NLPTwitterAnalysisClustering 0.81 nan 0.78 0.86
NQ-Fa 0.30 nan 0.45 0.50
ParsinluEntail 0.75 nan 0.65 0.78
ParsinluQueryParaphPC 0.89 nan 0.88 0.90
PersianFoodSentimentClassification 0.77 nan 0.82 0.87
PersianTextEmotion 0.65 nan 0.62 0.92
PersianWebDocumentRetrieval 0.57 nan 0.47 0.57
Query2Query 0.82 nan 0.67 0.82
QuoraRetrieval-Fa 0.78 nan 0.80 0.82
SAMSumFa 0.92 nan 0.92 0.99
SCIDOCS-Fa 0.13 nan 0.12 0.18
SIDClassification 0.60 nan 0.61 0.68
SIDClustring 0.40 nan 0.39 0.55
SciFact-Fa 0.48 nan 0.60 0.74
SentimentDKSF 0.71 nan 0.71 0.83
SynPerChatbotConvSAAnger 0.82 nan 0.72 0.97
SynPerChatbotConvSAClassification 0.69 nan 0.61 0.90
SynPerChatbotConvSAFear 0.80 nan 0.74 0.96
SynPerChatbotConvSAFriendship 0.57 nan 0.53 0.75
SynPerChatbotConvSAHappiness 0.61 nan 0.52 0.92
SynPerChatbotConvSAJealousy 0.72 nan 0.70 0.84
SynPerChatbotConvSALove 0.59 nan 0.46 0.86
SynPerChatbotConvSASadness 0.79 nan 0.64 0.95
SynPerChatbotConvSASatisfaction 0.74 nan 0.61 0.98
SynPerChatbotConvSASurprise 0.62 nan 0.54 0.86
SynPerChatbotConvSAToneChatbotClassification 0.59 nan 0.58 0.99
SynPerChatbotConvSAToneUserClassification 0.56 nan 0.53 0.97
SynPerChatbotRAGFAQPC 0.68 nan 0.63 0.93
SynPerChatbotRAGFAQRetrieval 0.32 nan 0.23 0.54
SynPerChatbotRAGSumSRetrieval 0.57 nan 0.50 0.80
SynPerChatbotRAGToneChatbotClassification 0.34 nan 0.37 0.90
SynPerChatbotRAGToneUserClassification 0.53 nan 0.51 0.93
SynPerChatbotRAGTopicsRetrieval 0.22 nan 0.19 0.50
SynPerChatbotSatisfactionLevelClassification 0.29 nan 0.25 0.60
SynPerChatbotSumSRetrieval 0.35 nan 0.28 0.70
SynPerChatbotToneChatbotClassification 0.42 nan 0.41 0.95
SynPerChatbotToneUserClassification 0.49 nan 0.47 0.95
SynPerChatbotTopicsRetrieval 0.14 nan 0.12 0.52
SynPerQAPC 0.92 nan 0.95 0.98
SynPerQARetrieval 0.81 nan 0.87 0.90
SynPerSTS 0.84 nan 0.88 0.90
SynPerTextKeywordsPC 0.96 nan 0.95 0.99
SynPerTextToneClassification 0.64 nan 0.70 0.91
TRECCOVID-Fa 0.52 nan 0.72 0.77
Touche2020-Fa 0.16 nan 0.26 0.26
WikipediaRerankingMultilingual 0.83 0.92 0.89 0.92
WikipediaRetrievalMultilingual 0.84 0.94 0.90 0.94
Average 0.54 0.89 0.54 0.70

Results for MCINext/Hakim

task_name MCINext/Hakim google/gemini-embedding-001 intfloat/multilingual-e5-large Max result
ArguAna-Fa 0.44 nan 0.45 0.62
BeytooteClustering 0.65 nan 0.62 0.68
CExaPPC 0.99 nan 0.99 0.99
CQADupstackAndroidRetrieval-Fa 0.15 nan 0.42 0.47
CQADupstackEnglishRetrieval-Fa 0.15 nan 0.30 0.35
CQADupstackGamingRetrieval-Fa 0.19 nan 0.46 0.49
CQADupstackGisRetrieval-Fa 0.12 nan 0.30 0.35
CQADupstackMathematicaRetrieval-Fa 0.11 nan 0.19 0.26
CQADupstackPhysicsRetrieval-Fa 0.18 nan 0.37 0.41
CQADupstackProgrammersRetrieval-Fa 0.18 nan 0.35 0.38
CQADupstackRetrieval-Fa 0.14 nan 0.32 0.36
CQADupstackStatsRetrieval-Fa 0.15 nan 0.28 0.30
CQADupstackTexRetrieval-Fa 0.09 nan 0.20 0.25
CQADupstackUnixRetrieval-Fa 0.14 nan 0.32 0.37
CQADupstackWebmastersRetrieval-Fa 0.15 nan 0.34 0.38
CQADupstackWordpressRetrieval-Fa 0.11 nan 0.26 0.30
ClimateFEVER-Fa 0.16 nan 0.13 0.30
DBPedia-Fa 0.21 nan 0.30 0.37
DeepSentiPers 0.73 nan 0.61 0.73
DigikalamagClassification 0.91 nan 0.87 0.91
DigikalamagClustering 0.79 nan 0.40 0.79
FarsTail 0.74 nan 0.73 0.82
FarsiParaphraseDetection 1.00 nan 0.98 1.00
Farsick 0.72 nan 0.71 0.77
FiQA2018-Fa 0.25 nan 0.30 0.37
HamshahriClustring 0.69 nan 0.67 0.76
HotpotQA-Fa 0.45 nan 0.60 0.61
MIRACLReranking 0.50 nan 0.65 0.66
MIRACLRetrieval 0.43 nan 0.59 0.72
MSMARCO-Fa 0.22 nan 0.31 0.31
MassiveIntentClassification 0.69 0.82 0.60 0.92
MassiveScenarioClassification 0.85 0.87 0.70 0.99
NFCorpus-Fa 0.21 nan 0.29 0.31
NLPTwitterAnalysisClassification 0.79 nan 0.76 0.79
NLPTwitterAnalysisClustering 0.86 nan 0.78 0.86
NQ-Fa 0.29 nan 0.45 0.50
ParsinluEntail 0.68 nan 0.65 0.78
ParsinluQueryParaphPC 0.87 nan 0.88 0.90
PersianFoodSentimentClassification 0.87 nan 0.82 0.87
PersianTextEmotion 0.92 nan 0.62 0.92
PersianWebDocumentRetrieval 0.35 nan 0.47 0.57
Query2Query 0.76 nan 0.67 0.82
QuoraRetrieval-Fa 0.64 nan 0.80 0.82
SAMSumFa 0.99 nan 0.92 0.99
SCIDOCS-Fa 0.11 nan 0.12 0.18
SIDClassification 0.68 nan 0.61 0.68
SIDClustring 0.55 nan 0.39 0.55
SciFact-Fa 0.54 nan 0.60 0.74
SentimentDKSF 0.83 nan 0.71 0.83
SynPerChatbotConvSAAnger 0.97 nan 0.72 0.97
SynPerChatbotConvSAClassification 0.90 nan 0.61 0.90
SynPerChatbotConvSAFear 0.96 nan 0.74 0.96
SynPerChatbotConvSAFriendship 0.75 nan 0.53 0.75
SynPerChatbotConvSAHappiness 0.92 nan 0.52 0.92
SynPerChatbotConvSAJealousy 0.82 nan 0.70 0.84
SynPerChatbotConvSALove 0.86 nan 0.46 0.86
SynPerChatbotConvSASadness 0.95 nan 0.64 0.95
SynPerChatbotConvSASatisfaction 0.98 nan 0.61 0.98
SynPerChatbotConvSASurprise 0.86 nan 0.54 0.86
SynPerChatbotConvSAToneChatbotClassification 0.99 nan 0.58 0.99
SynPerChatbotConvSAToneUserClassification 0.97 nan 0.53 0.97
SynPerChatbotRAGFAQPC 0.93 nan 0.63 0.93
SynPerChatbotRAGFAQRetrieval 0.54 nan 0.23 0.54
SynPerChatbotRAGSumSRetrieval 0.80 nan 0.50 0.80
SynPerChatbotRAGToneChatbotClassification 0.90 nan 0.37 0.90
SynPerChatbotRAGToneUserClassification 0.93 nan 0.51 0.93
SynPerChatbotRAGTopicsRetrieval 0.50 nan 0.19 0.50
SynPerChatbotSatisfactionLevelClassification 0.60 nan 0.25 0.60
SynPerChatbotSumSRetrieval 0.70 nan 0.28 0.70
SynPerChatbotToneChatbotClassification 0.95 nan 0.41 0.95
SynPerChatbotToneUserClassification 0.95 nan 0.47 0.95
SynPerChatbotTopicsRetrieval 0.52 nan 0.12 0.52
SynPerQAPC 0.98 nan 0.95 0.98
SynPerQARetrieval 0.90 nan 0.87 0.90
SynPerSTS 0.88 nan 0.88 0.90
SynPerTextKeywordsPC 0.99 nan 0.95 0.99
SynPerTextToneClassification 0.90 nan 0.70 0.91
TRECCOVID-Fa 0.53 nan 0.72 0.77
Touche2020-Fa 0.17 nan 0.26 0.26
WikipediaRerankingMultilingual 0.89 0.92 0.89 0.92
WikipediaRetrievalMultilingual 0.88 0.94 0.90 0.94
Average 0.62 0.89 0.54 0.70

Results for PartAI/Tooka-SBERT-V2-Large

task_name PartAI/Tooka-SBERT-V2-Large google/gemini-embedding-001 intfloat/multilingual-e5-large Max result
ArguAna-Fa 0.42 nan 0.45 0.62
BeytooteClustering 0.57 nan 0.62 0.68
CExaPPC 0.99 nan 0.99 0.99
CQADupstackAndroidRetrieval-Fa 0.37 nan 0.42 0.47
CQADupstackEnglishRetrieval-Fa 0.25 nan 0.30 0.35
CQADupstackGamingRetrieval-Fa 0.38 nan 0.46 0.49
CQADupstackGisRetrieval-Fa 0.22 nan 0.30 0.35
CQADupstackMathematicaRetrieval-Fa 0.14 nan 0.19 0.26
CQADupstackPhysicsRetrieval-Fa 0.33 nan 0.37 0.41
CQADupstackProgrammersRetrieval-Fa 0.28 nan 0.35 0.38
CQADupstackRetrieval-Fa 0.26 nan 0.32 0.36
CQADupstackStatsRetrieval-Fa 0.21 nan 0.28 0.30
CQADupstackTexRetrieval-Fa 0.15 nan 0.20 0.25
CQADupstackUnixRetrieval-Fa 0.24 nan 0.32 0.37
CQADupstackWebmastersRetrieval-Fa 0.3 nan 0.34 0.38
CQADupstackWordpressRetrieval-Fa 0.2 nan 0.26 0.30
ClimateFEVER-Fa 0.12 nan 0.13 0.30
DBPedia-Fa 0.2 nan 0.30 0.37
DeepSentiPers 0.66 nan 0.61 0.73
DigikalamagClassification 0.78 nan 0.87 0.91
DigikalamagClustering 0.49 nan 0.40 0.79
FarsTail 0.8 nan 0.73 0.82
FarsiParaphraseDetection 0.95 nan 0.98 1.00
Farsick 0.66 nan 0.71 0.77
FiQA2018-Fa 0.19 nan 0.30 0.37
HamshahriClustring 0.66 nan 0.67 0.76
HotpotQA-Fa 0.28 nan 0.60 0.61
MIRACLReranking 0.53 nan 0.65 0.66
MIRACLRetrieval 0.45 nan 0.59 0.72
MSMARCO-Fa 0.17 nan 0.31 0.31
MassiveIntentClassification 0.68 0.82 0.60 0.92
MassiveScenarioClassification 0.72 0.87 0.70 0.99
NFCorpus-Fa 0.25 nan 0.29 0.31
NLPTwitterAnalysisClassification 0.77 nan 0.76 0.79
NLPTwitterAnalysisClustering 0.81 nan 0.78 0.86
NQ-Fa 0.26 nan 0.45 0.50
ParsinluEntail 0.78 nan 0.65 0.78
ParsinluQueryParaphPC 0.88 nan 0.88 0.90
PersianFoodSentimentClassification 0.79 nan 0.82 0.87
PersianTextEmotion 0.57 nan 0.62 0.92
PersianWebDocumentRetrieval 0.38 nan 0.47 0.57
Query2Query 0.71 nan 0.67 0.82
QuoraRetrieval-Fa 0.79 nan 0.80 0.82
SAMSumFa 0.87 nan 0.92 0.99
SCIDOCS-Fa 0.11 nan 0.12 0.18
SIDClassification 0.55 nan 0.61 0.68
SIDClustring 0.44 nan 0.39 0.55
SciFact-Fa 0.43 nan 0.60 0.74
SentimentDKSF 0.72 nan 0.71 0.83
SynPerChatbotConvSAAnger 0.86 nan 0.72 0.97
SynPerChatbotConvSAClassification 0.73 nan 0.61 0.90
SynPerChatbotConvSAFear 0.83 nan 0.74 0.96
SynPerChatbotConvSAFriendship 0.53 nan 0.53 0.75
SynPerChatbotConvSAHappiness 0.65 nan 0.52 0.92
SynPerChatbotConvSAJealousy 0.73 nan 0.70 0.84
SynPerChatbotConvSALove 0.67 nan 0.46 0.86
SynPerChatbotConvSASadness 0.82 nan 0.64 0.95
SynPerChatbotConvSASatisfaction 0.86 nan 0.61 0.98
SynPerChatbotConvSASurprise 0.64 nan 0.54 0.86
SynPerChatbotConvSAToneChatbotClassification 0.65 nan 0.58 0.99
SynPerChatbotConvSAToneUserClassification 0.59 nan 0.53 0.97
SynPerChatbotRAGFAQPC 0.67 nan 0.63 0.93
SynPerChatbotRAGFAQRetrieval 0.27 nan 0.23 0.54
SynPerChatbotRAGSumSRetrieval 0.48 nan 0.50 0.80
SynPerChatbotRAGToneChatbotClassification 0.38 nan 0.37 0.90
SynPerChatbotRAGToneUserClassification 0.53 nan 0.51 0.93
SynPerChatbotRAGTopicsRetrieval 0.2 nan 0.19 0.50
SynPerChatbotSatisfactionLevelClassification 0.35 nan 0.25 0.60
SynPerChatbotSumSRetrieval 0.24 nan 0.28 0.70
SynPerChatbotToneChatbotClassification 0.48 nan 0.41 0.95
SynPerChatbotToneUserClassification 0.49 nan 0.47 0.95
SynPerChatbotTopicsRetrieval 0.13 nan 0.12 0.52
SynPerQAPC 0.93 nan 0.95 0.98
SynPerQARetrieval 0.82 nan 0.87 0.90
SynPerSTS 0.88 nan 0.88 0.90
SynPerTextKeywordsPC 0.95 nan 0.95 0.99
SynPerTextToneClassification 0.72 nan 0.70 0.91
TRECCOVID-Fa 0.51 nan 0.72 0.77
Touche2020-Fa 0.18 nan 0.26 0.26
WikipediaRerankingMultilingual 0.86 0.92 0.89 0.92
WikipediaRetrievalMultilingual 0.87 0.94 0.90 0.94
Average 0.53 0.89 0.54 0.70

Results for PartAI/Tooka-SBERT-V2-Small

task_name PartAI/Tooka-SBERT-V2-Small google/gemini-embedding-001 intfloat/multilingual-e5-large Max result
ArguAna-Fa 0.41 nan 0.45 0.62
BeytooteClustering 0.61 nan 0.62 0.68
CExaPPC 0.95 nan 0.99 0.99
CQADupstackAndroidRetrieval-Fa 0.33 nan 0.42 0.47
CQADupstackEnglishRetrieval-Fa 0.22 nan 0.30 0.35
CQADupstackGamingRetrieval-Fa 0.35 nan 0.46 0.49
CQADupstackGisRetrieval-Fa 0.21 nan 0.30 0.35
CQADupstackMathematicaRetrieval-Fa 0.15 nan 0.19 0.26
CQADupstackPhysicsRetrieval-Fa 0.31 nan 0.37 0.41
CQADupstackProgrammersRetrieval-Fa 0.26 nan 0.35 0.38
CQADupstackRetrieval-Fa 0.23 nan 0.32 0.36
CQADupstackStatsRetrieval-Fa 0.2 nan 0.28 0.30
CQADupstackTexRetrieval-Fa 0.13 nan 0.20 0.25
CQADupstackUnixRetrieval-Fa 0.22 nan 0.32 0.37
CQADupstackWebmastersRetrieval-Fa 0.27 nan 0.34 0.38
CQADupstackWordpressRetrieval-Fa 0.17 nan 0.26 0.30
ClimateFEVER-Fa 0.15 nan 0.13 0.30
DBPedia-Fa 0.23 nan 0.30 0.37
DeepSentiPers 0.62 nan 0.61 0.73
DigikalamagClassification 0.77 nan 0.87 0.91
DigikalamagClustering 0.48 nan 0.40 0.79
FarsTail 0.76 nan 0.73 0.82
FarsiParaphraseDetection 0.95 nan 0.98 1.00
Farsick 0.64 nan 0.71 0.77
FiQA2018-Fa 0.16 nan 0.30 0.37
HamshahriClustring 0.65 nan 0.67 0.76
HotpotQA-Fa 0.29 nan 0.60 0.61
MIRACLReranking 0.55 nan 0.65 0.66
MIRACLRetrieval 0.51 nan 0.59 0.72
MSMARCO-Fa 0.17 nan 0.31 0.31
MassiveIntentClassification 0.66 0.82 0.60 0.92
MassiveScenarioClassification 0.69 0.87 0.70 0.99
NFCorpus-Fa 0.25 nan 0.29 0.31
NLPTwitterAnalysisClassification 0.77 nan 0.76 0.79
NLPTwitterAnalysisClustering 0.79 nan 0.78 0.86
NQ-Fa 0.25 nan 0.45 0.50
ParsinluEntail 0.73 nan 0.65 0.78
ParsinluQueryParaphPC 0.87 nan 0.88 0.90
PersianFoodSentimentClassification 0.77 nan 0.82 0.87
PersianTextEmotion 0.53 nan 0.62 0.92
PersianWebDocumentRetrieval 0.45 nan 0.47 0.57
Query2Query 0.7 nan 0.67 0.82
QuoraRetrieval-Fa 0.76 nan 0.80 0.82
SAMSumFa 0.74 nan 0.92 0.99
SCIDOCS-Fa 0.11 nan 0.12 0.18
SIDClassification 0.55 nan 0.61 0.68
SIDClustring 0.42 nan 0.39 0.55
SciFact-Fa 0.43 nan 0.60 0.74
SentimentDKSF 0.67 nan 0.71 0.83
SynPerChatbotConvSAAnger 0.81 nan 0.72 0.97
SynPerChatbotConvSAClassification 0.66 nan 0.61 0.90
SynPerChatbotConvSAFear 0.81 nan 0.74 0.96
SynPerChatbotConvSAFriendship 0.51 nan 0.53 0.75
SynPerChatbotConvSAHappiness 0.6 nan 0.52 0.92
SynPerChatbotConvSAJealousy 0.59 nan 0.70 0.84
SynPerChatbotConvSALove 0.53 nan 0.46 0.86
SynPerChatbotConvSASadness 0.76 nan 0.64 0.95
SynPerChatbotConvSASatisfaction 0.74 nan 0.61 0.98
SynPerChatbotConvSASurprise 0.55 nan 0.54 0.86
SynPerChatbotConvSAToneChatbotClassification 0.62 nan 0.58 0.99
SynPerChatbotConvSAToneUserClassification 0.58 nan 0.53 0.97
SynPerChatbotRAGFAQPC 0.68 nan 0.63 0.93
SynPerChatbotRAGFAQRetrieval 0.29 nan 0.23 0.54
SynPerChatbotRAGSumSRetrieval 0.41 nan 0.50 0.80
SynPerChatbotRAGToneChatbotClassification 0.36 nan 0.37 0.90
SynPerChatbotRAGToneUserClassification 0.51 nan 0.51 0.93
SynPerChatbotRAGTopicsRetrieval 0.25 nan 0.19 0.50
SynPerChatbotSatisfactionLevelClassification 0.29 nan 0.25 0.60
SynPerChatbotSumSRetrieval 0.19 nan 0.28 0.70
SynPerChatbotToneChatbotClassification 0.43 nan 0.41 0.95
SynPerChatbotToneUserClassification 0.47 nan 0.47 0.95
SynPerChatbotTopicsRetrieval 0.19 nan 0.12 0.52
SynPerQAPC 0.92 nan 0.95 0.98
SynPerQARetrieval 0.8 nan 0.87 0.90
SynPerSTS 0.86 nan 0.88 0.90
SynPerTextKeywordsPC 0.97 nan 0.95 0.99
SynPerTextToneClassification 0.61 nan 0.70 0.91
TRECCOVID-Fa 0.59 nan 0.72 0.77
Touche2020-Fa 0.22 nan 0.26 0.26
WikipediaRerankingMultilingual 0.85 0.92 0.89 0.92
WikipediaRetrievalMultilingual 0.86 0.94 0.90 0.94
Average 0.51 0.89 0.54 0.70

@KennethEnevoldsen
Copy link
Contributor

@mehran-sarmadi many of these scores are waay above the baseline e5? Do you have any ideas why this might be? Especially for the SynPerChatbot* tasks

@mehran-sarmadi
Copy link
Contributor Author

mehran-sarmadi commented Jul 5, 2025

@mehran-sarmadi many of these scores are waay above the baseline e5? Do you have any ideas why this might be? Especially for the SynPerChatbot* tasks

Hi @KennethEnevoldsen

The main reason for this is that the Hakim and Hakim-small models were fine-tuned on the training portions of the SynPerChatbot datasets during Stage 3 (supervised fine-tuning). As a result, their "zero-shot" scores on the leaderboard are much lower than those of models like E5 — for example, Hakim and Hakim-small score around 33%, whereas E5 scores approximately 91%. In contrast, the Hakim-unsupervised model did not undergo Stage 3 fine-tuning, and its zero-shot score is also around 90%, so its results are much closer to the E5 baseline — only slightly better, likely due to its exposure to a large volume of high-quality Persian data.

@KennethEnevoldsen
Copy link
Contributor

Thanks for the clarification @mehran-sarmadi, we will have to add this to the model card. You can add it like so:

# in model meta
...
training_datasets={"SynPerChatbotRAGTopicsRetrieval": ["train"], 
...} # add all datasets for which you have trained on (even if the split is the training split)

@mehran-sarmadi
Copy link
Contributor Author

Thanks for the clarification @mehran-sarmadi, we will have to add this to the model card. You can add it like so:

# in model meta
...
training_datasets={"SynPerChatbotRAGTopicsRetrieval": ["train"], 
...} # add all datasets for which you have trained on (even if the split is the training split)

Hi @KennethEnevoldsen
Already done! They are included in the model meta.

@KennethEnevoldsen
Copy link
Contributor

Ah, sorry I overlooked that - then everything is good here!

@KennethEnevoldsen KennethEnevoldsen merged commit fc4e701 into embeddings-benchmark:main Jul 8, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants