add Results of Hakims and Tooka-SBERTV2s#221
add Results of Hakims and Tooka-SBERTV2s#221KennethEnevoldsen merged 2 commits intoembeddings-benchmark:mainfrom
Conversation
|
I think you need to update your PR firstly, because restarting action would restart only old version. But for now I think you need to finish integration of your model |
|
@Samoed, it seems like the CI fails here. Any idea why? |
|
I think this action before all modifications and to update it main should be merged to this branch |
Model Results ComparisonReference models: Results for
|
| task_name | MCINext/Hakim-small | google/gemini-embedding-001 | intfloat/multilingual-e5-large | Max result |
|---|---|---|---|---|
| ArguAna-Fa | 0.40 | nan | 0.45 | 0.62 |
| BeytooteClustering | 0.63 | nan | 0.62 | 0.68 |
| CExaPPC | 0.98 | nan | 0.99 | 0.99 |
| CQADupstackAndroidRetrieval-Fa | 0.15 | nan | 0.42 | 0.47 |
| CQADupstackEnglishRetrieval-Fa | 0.12 | nan | 0.30 | 0.35 |
| CQADupstackGamingRetrieval-Fa | 0.18 | nan | 0.46 | 0.49 |
| CQADupstackGisRetrieval-Fa | 0.11 | nan | 0.30 | 0.35 |
| CQADupstackMathematicaRetrieval-Fa | 0.10 | nan | 0.19 | 0.26 |
| CQADupstackPhysicsRetrieval-Fa | 0.17 | nan | 0.37 | 0.41 |
| CQADupstackProgrammersRetrieval-Fa | 0.15 | nan | 0.35 | 0.38 |
| CQADupstackRetrieval-Fa | 0.12 | nan | 0.32 | 0.36 |
| CQADupstackStatsRetrieval-Fa | 0.12 | nan | 0.28 | 0.30 |
| CQADupstackTexRetrieval-Fa | 0.07 | nan | 0.20 | 0.25 |
| CQADupstackUnixRetrieval-Fa | 0.12 | nan | 0.32 | 0.37 |
| CQADupstackWebmastersRetrieval-Fa | 0.13 | nan | 0.34 | 0.38 |
| CQADupstackWordpressRetrieval-Fa | 0.08 | nan | 0.26 | 0.30 |
| ClimateFEVER-Fa | 0.16 | nan | 0.13 | 0.30 |
| DBPedia-Fa | 0.21 | nan | 0.30 | 0.37 |
| DeepSentiPers | 0.67 | nan | 0.61 | 0.73 |
| DigikalamagClassification | 0.90 | nan | 0.87 | 0.91 |
| DigikalamagClustering | 0.67 | nan | 0.40 | 0.79 |
| FarsTail | 0.71 | nan | 0.73 | 0.82 |
| FarsiParaphraseDetection | 0.99 | nan | 0.98 | 1.00 |
| Farsick | 0.68 | nan | 0.71 | 0.77 |
| FiQA2018-Fa | 0.19 | nan | 0.30 | 0.37 |
| HamshahriClustring | 0.68 | nan | 0.67 | 0.76 |
| HotpotQA-Fa | 0.42 | nan | 0.60 | 0.61 |
| MIRACLReranking | 0.48 | nan | 0.65 | 0.66 |
| MIRACLRetrieval | 0.41 | nan | 0.59 | 0.72 |
| MSMARCO-Fa | 0.19 | nan | 0.31 | 0.31 |
| MassiveIntentClassification | 0.64 | 0.82 | 0.60 | 0.92 |
| MassiveScenarioClassification | 0.78 | 0.87 | 0.70 | 0.99 |
| NFCorpus-Fa | 0.23 | nan | 0.29 | 0.31 |
| NLPTwitterAnalysisClassification | 0.78 | nan | 0.76 | 0.79 |
| NLPTwitterAnalysisClustering | 0.84 | nan | 0.78 | 0.86 |
| NQ-Fa | 0.24 | nan | 0.45 | 0.50 |
| ParsinluEntail | 0.66 | nan | 0.65 | 0.78 |
| ParsinluQueryParaphPC | 0.85 | nan | 0.88 | 0.90 |
| PersianFoodSentimentClassification | 0.84 | nan | 0.82 | 0.87 |
| PersianTextEmotion | 0.87 | nan | 0.62 | 0.92 |
| PersianWebDocumentRetrieval | 0.35 | nan | 0.47 | 0.57 |
| Query2Query | 0.76 | nan | 0.67 | 0.82 |
| QuoraRetrieval-Fa | 0.74 | nan | 0.80 | 0.82 |
| SAMSumFa | 0.98 | nan | 0.92 | 0.99 |
| SCIDOCS-Fa | 0.10 | nan | 0.12 | 0.18 |
| SIDClassification | 0.65 | nan | 0.61 | 0.68 |
| SIDClustring | 0.49 | nan | 0.39 | 0.55 |
| SciFact-Fa | 0.52 | nan | 0.60 | 0.74 |
| SentimentDKSF | 0.80 | nan | 0.71 | 0.83 |
| SynPerChatbotConvSAAnger | 0.96 | nan | 0.72 | 0.97 |
| SynPerChatbotConvSAClassification | 0.87 | nan | 0.61 | 0.90 |
| SynPerChatbotConvSAFear | 0.90 | nan | 0.74 | 0.96 |
| SynPerChatbotConvSAFriendship | 0.71 | nan | 0.53 | 0.75 |
| SynPerChatbotConvSAHappiness | 0.91 | nan | 0.52 | 0.92 |
| SynPerChatbotConvSAJealousy | 0.76 | nan | 0.70 | 0.84 |
| SynPerChatbotConvSALove | 0.86 | nan | 0.46 | 0.86 |
| SynPerChatbotConvSASadness | 0.93 | nan | 0.64 | 0.95 |
| SynPerChatbotConvSASatisfaction | 0.97 | nan | 0.61 | 0.98 |
| SynPerChatbotConvSASurprise | 0.79 | nan | 0.54 | 0.86 |
| SynPerChatbotConvSAToneChatbotClassification | 0.98 | nan | 0.58 | 0.99 |
| SynPerChatbotConvSAToneUserClassification | 0.91 | nan | 0.53 | 0.97 |
| SynPerChatbotRAGFAQPC | 0.84 | nan | 0.63 | 0.93 |
| SynPerChatbotRAGFAQRetrieval | 0.49 | nan | 0.23 | 0.54 |
| SynPerChatbotRAGSumSRetrieval | 0.75 | nan | 0.50 | 0.80 |
| SynPerChatbotRAGToneChatbotClassification | 0.87 | nan | 0.37 | 0.90 |
| SynPerChatbotRAGToneUserClassification | 0.85 | nan | 0.51 | 0.93 |
| SynPerChatbotRAGTopicsRetrieval | 0.44 | nan | 0.19 | 0.50 |
| SynPerChatbotSatisfactionLevelClassification | 0.54 | nan | 0.25 | 0.60 |
| SynPerChatbotSumSRetrieval | 0.63 | nan | 0.28 | 0.70 |
| SynPerChatbotToneChatbotClassification | 0.92 | nan | 0.41 | 0.95 |
| SynPerChatbotToneUserClassification | 0.90 | nan | 0.47 | 0.95 |
| SynPerChatbotTopicsRetrieval | 0.46 | nan | 0.12 | 0.52 |
| SynPerQAPC | 0.97 | nan | 0.95 | 0.98 |
| SynPerQARetrieval | 0.88 | nan | 0.87 | 0.90 |
| SynPerSTS | 0.86 | nan | 0.88 | 0.90 |
| SynPerTextKeywordsPC | 0.98 | nan | 0.95 | 0.99 |
| SynPerTextToneClassification | 0.85 | nan | 0.70 | 0.91 |
| TRECCOVID-Fa | 0.42 | nan | 0.72 | 0.77 |
| Touche2020-Fa | 0.16 | nan | 0.26 | 0.26 |
| WikipediaRerankingMultilingual | 0.87 | 0.92 | 0.89 | 0.92 |
| WikipediaRetrievalMultilingual | 0.85 | 0.94 | 0.90 | 0.94 |
| Average | 0.59 | 0.89 | 0.54 | 0.70 |
Results for MCINext/Hakim-unsup
| task_name | MCINext/Hakim-unsup | google/gemini-embedding-001 | intfloat/multilingual-e5-large | Max result |
|---|---|---|---|---|
| ArguAna-Fa | 0.37 | nan | 0.45 | 0.62 |
| BeytooteClustering | 0.60 | nan | 0.62 | 0.68 |
| CExaPPC | 0.96 | nan | 0.99 | 0.99 |
| CQADupstackAndroidRetrieval-Fa | 0.38 | nan | 0.42 | 0.47 |
| CQADupstackEnglishRetrieval-Fa | 0.30 | nan | 0.30 | 0.35 |
| CQADupstackGamingRetrieval-Fa | 0.41 | nan | 0.46 | 0.49 |
| CQADupstackGisRetrieval-Fa | 0.24 | nan | 0.30 | 0.35 |
| CQADupstackMathematicaRetrieval-Fa | 0.18 | nan | 0.19 | 0.26 |
| CQADupstackPhysicsRetrieval-Fa | 0.35 | nan | 0.37 | 0.41 |
| CQADupstackProgrammersRetrieval-Fa | 0.27 | nan | 0.35 | 0.38 |
| CQADupstackRetrieval-Fa | 0.27 | nan | 0.32 | 0.36 |
| CQADupstackStatsRetrieval-Fa | 0.23 | nan | 0.28 | 0.30 |
| CQADupstackTexRetrieval-Fa | 0.16 | nan | 0.20 | 0.25 |
| CQADupstackUnixRetrieval-Fa | 0.26 | nan | 0.32 | 0.37 |
| CQADupstackWebmastersRetrieval-Fa | 0.30 | nan | 0.34 | 0.38 |
| CQADupstackWordpressRetrieval-Fa | 0.21 | nan | 0.26 | 0.30 |
| ClimateFEVER-Fa | 0.16 | nan | 0.13 | 0.30 |
| DBPedia-Fa | 0.26 | nan | 0.30 | 0.37 |
| DeepSentiPers | 0.66 | nan | 0.61 | 0.73 |
| DigikalamagClassification | 0.85 | nan | 0.87 | 0.91 |
| DigikalamagClustering | 0.44 | nan | 0.40 | 0.79 |
| FarsTail | 0.79 | nan | 0.73 | 0.82 |
| FarsiParaphraseDetection | 0.95 | nan | 0.98 | 1.00 |
| Farsick | 0.72 | nan | 0.71 | 0.77 |
| FiQA2018-Fa | 0.18 | nan | 0.30 | 0.37 |
| HamshahriClustring | 0.69 | nan | 0.67 | 0.76 |
| HotpotQA-Fa | 0.39 | nan | 0.60 | 0.61 |
| MIRACLReranking | 0.53 | nan | 0.65 | 0.66 |
| MIRACLRetrieval | 0.50 | nan | 0.59 | 0.72 |
| MSMARCO-Fa | 0.22 | nan | 0.31 | 0.31 |
| MassiveIntentClassification | 0.67 | 0.82 | 0.60 | 0.92 |
| MassiveScenarioClassification | 0.71 | 0.87 | 0.70 | 0.99 |
| NFCorpus-Fa | 0.29 | nan | 0.29 | 0.31 |
| NLPTwitterAnalysisClassification | 0.76 | nan | 0.76 | 0.79 |
| NLPTwitterAnalysisClustering | 0.81 | nan | 0.78 | 0.86 |
| NQ-Fa | 0.30 | nan | 0.45 | 0.50 |
| ParsinluEntail | 0.75 | nan | 0.65 | 0.78 |
| ParsinluQueryParaphPC | 0.89 | nan | 0.88 | 0.90 |
| PersianFoodSentimentClassification | 0.77 | nan | 0.82 | 0.87 |
| PersianTextEmotion | 0.65 | nan | 0.62 | 0.92 |
| PersianWebDocumentRetrieval | 0.57 | nan | 0.47 | 0.57 |
| Query2Query | 0.82 | nan | 0.67 | 0.82 |
| QuoraRetrieval-Fa | 0.78 | nan | 0.80 | 0.82 |
| SAMSumFa | 0.92 | nan | 0.92 | 0.99 |
| SCIDOCS-Fa | 0.13 | nan | 0.12 | 0.18 |
| SIDClassification | 0.60 | nan | 0.61 | 0.68 |
| SIDClustring | 0.40 | nan | 0.39 | 0.55 |
| SciFact-Fa | 0.48 | nan | 0.60 | 0.74 |
| SentimentDKSF | 0.71 | nan | 0.71 | 0.83 |
| SynPerChatbotConvSAAnger | 0.82 | nan | 0.72 | 0.97 |
| SynPerChatbotConvSAClassification | 0.69 | nan | 0.61 | 0.90 |
| SynPerChatbotConvSAFear | 0.80 | nan | 0.74 | 0.96 |
| SynPerChatbotConvSAFriendship | 0.57 | nan | 0.53 | 0.75 |
| SynPerChatbotConvSAHappiness | 0.61 | nan | 0.52 | 0.92 |
| SynPerChatbotConvSAJealousy | 0.72 | nan | 0.70 | 0.84 |
| SynPerChatbotConvSALove | 0.59 | nan | 0.46 | 0.86 |
| SynPerChatbotConvSASadness | 0.79 | nan | 0.64 | 0.95 |
| SynPerChatbotConvSASatisfaction | 0.74 | nan | 0.61 | 0.98 |
| SynPerChatbotConvSASurprise | 0.62 | nan | 0.54 | 0.86 |
| SynPerChatbotConvSAToneChatbotClassification | 0.59 | nan | 0.58 | 0.99 |
| SynPerChatbotConvSAToneUserClassification | 0.56 | nan | 0.53 | 0.97 |
| SynPerChatbotRAGFAQPC | 0.68 | nan | 0.63 | 0.93 |
| SynPerChatbotRAGFAQRetrieval | 0.32 | nan | 0.23 | 0.54 |
| SynPerChatbotRAGSumSRetrieval | 0.57 | nan | 0.50 | 0.80 |
| SynPerChatbotRAGToneChatbotClassification | 0.34 | nan | 0.37 | 0.90 |
| SynPerChatbotRAGToneUserClassification | 0.53 | nan | 0.51 | 0.93 |
| SynPerChatbotRAGTopicsRetrieval | 0.22 | nan | 0.19 | 0.50 |
| SynPerChatbotSatisfactionLevelClassification | 0.29 | nan | 0.25 | 0.60 |
| SynPerChatbotSumSRetrieval | 0.35 | nan | 0.28 | 0.70 |
| SynPerChatbotToneChatbotClassification | 0.42 | nan | 0.41 | 0.95 |
| SynPerChatbotToneUserClassification | 0.49 | nan | 0.47 | 0.95 |
| SynPerChatbotTopicsRetrieval | 0.14 | nan | 0.12 | 0.52 |
| SynPerQAPC | 0.92 | nan | 0.95 | 0.98 |
| SynPerQARetrieval | 0.81 | nan | 0.87 | 0.90 |
| SynPerSTS | 0.84 | nan | 0.88 | 0.90 |
| SynPerTextKeywordsPC | 0.96 | nan | 0.95 | 0.99 |
| SynPerTextToneClassification | 0.64 | nan | 0.70 | 0.91 |
| TRECCOVID-Fa | 0.52 | nan | 0.72 | 0.77 |
| Touche2020-Fa | 0.16 | nan | 0.26 | 0.26 |
| WikipediaRerankingMultilingual | 0.83 | 0.92 | 0.89 | 0.92 |
| WikipediaRetrievalMultilingual | 0.84 | 0.94 | 0.90 | 0.94 |
| Average | 0.54 | 0.89 | 0.54 | 0.70 |
Results for MCINext/Hakim
| task_name | MCINext/Hakim | google/gemini-embedding-001 | intfloat/multilingual-e5-large | Max result |
|---|---|---|---|---|
| ArguAna-Fa | 0.44 | nan | 0.45 | 0.62 |
| BeytooteClustering | 0.65 | nan | 0.62 | 0.68 |
| CExaPPC | 0.99 | nan | 0.99 | 0.99 |
| CQADupstackAndroidRetrieval-Fa | 0.15 | nan | 0.42 | 0.47 |
| CQADupstackEnglishRetrieval-Fa | 0.15 | nan | 0.30 | 0.35 |
| CQADupstackGamingRetrieval-Fa | 0.19 | nan | 0.46 | 0.49 |
| CQADupstackGisRetrieval-Fa | 0.12 | nan | 0.30 | 0.35 |
| CQADupstackMathematicaRetrieval-Fa | 0.11 | nan | 0.19 | 0.26 |
| CQADupstackPhysicsRetrieval-Fa | 0.18 | nan | 0.37 | 0.41 |
| CQADupstackProgrammersRetrieval-Fa | 0.18 | nan | 0.35 | 0.38 |
| CQADupstackRetrieval-Fa | 0.14 | nan | 0.32 | 0.36 |
| CQADupstackStatsRetrieval-Fa | 0.15 | nan | 0.28 | 0.30 |
| CQADupstackTexRetrieval-Fa | 0.09 | nan | 0.20 | 0.25 |
| CQADupstackUnixRetrieval-Fa | 0.14 | nan | 0.32 | 0.37 |
| CQADupstackWebmastersRetrieval-Fa | 0.15 | nan | 0.34 | 0.38 |
| CQADupstackWordpressRetrieval-Fa | 0.11 | nan | 0.26 | 0.30 |
| ClimateFEVER-Fa | 0.16 | nan | 0.13 | 0.30 |
| DBPedia-Fa | 0.21 | nan | 0.30 | 0.37 |
| DeepSentiPers | 0.73 | nan | 0.61 | 0.73 |
| DigikalamagClassification | 0.91 | nan | 0.87 | 0.91 |
| DigikalamagClustering | 0.79 | nan | 0.40 | 0.79 |
| FarsTail | 0.74 | nan | 0.73 | 0.82 |
| FarsiParaphraseDetection | 1.00 | nan | 0.98 | 1.00 |
| Farsick | 0.72 | nan | 0.71 | 0.77 |
| FiQA2018-Fa | 0.25 | nan | 0.30 | 0.37 |
| HamshahriClustring | 0.69 | nan | 0.67 | 0.76 |
| HotpotQA-Fa | 0.45 | nan | 0.60 | 0.61 |
| MIRACLReranking | 0.50 | nan | 0.65 | 0.66 |
| MIRACLRetrieval | 0.43 | nan | 0.59 | 0.72 |
| MSMARCO-Fa | 0.22 | nan | 0.31 | 0.31 |
| MassiveIntentClassification | 0.69 | 0.82 | 0.60 | 0.92 |
| MassiveScenarioClassification | 0.85 | 0.87 | 0.70 | 0.99 |
| NFCorpus-Fa | 0.21 | nan | 0.29 | 0.31 |
| NLPTwitterAnalysisClassification | 0.79 | nan | 0.76 | 0.79 |
| NLPTwitterAnalysisClustering | 0.86 | nan | 0.78 | 0.86 |
| NQ-Fa | 0.29 | nan | 0.45 | 0.50 |
| ParsinluEntail | 0.68 | nan | 0.65 | 0.78 |
| ParsinluQueryParaphPC | 0.87 | nan | 0.88 | 0.90 |
| PersianFoodSentimentClassification | 0.87 | nan | 0.82 | 0.87 |
| PersianTextEmotion | 0.92 | nan | 0.62 | 0.92 |
| PersianWebDocumentRetrieval | 0.35 | nan | 0.47 | 0.57 |
| Query2Query | 0.76 | nan | 0.67 | 0.82 |
| QuoraRetrieval-Fa | 0.64 | nan | 0.80 | 0.82 |
| SAMSumFa | 0.99 | nan | 0.92 | 0.99 |
| SCIDOCS-Fa | 0.11 | nan | 0.12 | 0.18 |
| SIDClassification | 0.68 | nan | 0.61 | 0.68 |
| SIDClustring | 0.55 | nan | 0.39 | 0.55 |
| SciFact-Fa | 0.54 | nan | 0.60 | 0.74 |
| SentimentDKSF | 0.83 | nan | 0.71 | 0.83 |
| SynPerChatbotConvSAAnger | 0.97 | nan | 0.72 | 0.97 |
| SynPerChatbotConvSAClassification | 0.90 | nan | 0.61 | 0.90 |
| SynPerChatbotConvSAFear | 0.96 | nan | 0.74 | 0.96 |
| SynPerChatbotConvSAFriendship | 0.75 | nan | 0.53 | 0.75 |
| SynPerChatbotConvSAHappiness | 0.92 | nan | 0.52 | 0.92 |
| SynPerChatbotConvSAJealousy | 0.82 | nan | 0.70 | 0.84 |
| SynPerChatbotConvSALove | 0.86 | nan | 0.46 | 0.86 |
| SynPerChatbotConvSASadness | 0.95 | nan | 0.64 | 0.95 |
| SynPerChatbotConvSASatisfaction | 0.98 | nan | 0.61 | 0.98 |
| SynPerChatbotConvSASurprise | 0.86 | nan | 0.54 | 0.86 |
| SynPerChatbotConvSAToneChatbotClassification | 0.99 | nan | 0.58 | 0.99 |
| SynPerChatbotConvSAToneUserClassification | 0.97 | nan | 0.53 | 0.97 |
| SynPerChatbotRAGFAQPC | 0.93 | nan | 0.63 | 0.93 |
| SynPerChatbotRAGFAQRetrieval | 0.54 | nan | 0.23 | 0.54 |
| SynPerChatbotRAGSumSRetrieval | 0.80 | nan | 0.50 | 0.80 |
| SynPerChatbotRAGToneChatbotClassification | 0.90 | nan | 0.37 | 0.90 |
| SynPerChatbotRAGToneUserClassification | 0.93 | nan | 0.51 | 0.93 |
| SynPerChatbotRAGTopicsRetrieval | 0.50 | nan | 0.19 | 0.50 |
| SynPerChatbotSatisfactionLevelClassification | 0.60 | nan | 0.25 | 0.60 |
| SynPerChatbotSumSRetrieval | 0.70 | nan | 0.28 | 0.70 |
| SynPerChatbotToneChatbotClassification | 0.95 | nan | 0.41 | 0.95 |
| SynPerChatbotToneUserClassification | 0.95 | nan | 0.47 | 0.95 |
| SynPerChatbotTopicsRetrieval | 0.52 | nan | 0.12 | 0.52 |
| SynPerQAPC | 0.98 | nan | 0.95 | 0.98 |
| SynPerQARetrieval | 0.90 | nan | 0.87 | 0.90 |
| SynPerSTS | 0.88 | nan | 0.88 | 0.90 |
| SynPerTextKeywordsPC | 0.99 | nan | 0.95 | 0.99 |
| SynPerTextToneClassification | 0.90 | nan | 0.70 | 0.91 |
| TRECCOVID-Fa | 0.53 | nan | 0.72 | 0.77 |
| Touche2020-Fa | 0.17 | nan | 0.26 | 0.26 |
| WikipediaRerankingMultilingual | 0.89 | 0.92 | 0.89 | 0.92 |
| WikipediaRetrievalMultilingual | 0.88 | 0.94 | 0.90 | 0.94 |
| Average | 0.62 | 0.89 | 0.54 | 0.70 |
Results for PartAI/Tooka-SBERT-V2-Large
| task_name | PartAI/Tooka-SBERT-V2-Large | google/gemini-embedding-001 | intfloat/multilingual-e5-large | Max result |
|---|---|---|---|---|
| ArguAna-Fa | 0.42 | nan | 0.45 | 0.62 |
| BeytooteClustering | 0.57 | nan | 0.62 | 0.68 |
| CExaPPC | 0.99 | nan | 0.99 | 0.99 |
| CQADupstackAndroidRetrieval-Fa | 0.37 | nan | 0.42 | 0.47 |
| CQADupstackEnglishRetrieval-Fa | 0.25 | nan | 0.30 | 0.35 |
| CQADupstackGamingRetrieval-Fa | 0.38 | nan | 0.46 | 0.49 |
| CQADupstackGisRetrieval-Fa | 0.22 | nan | 0.30 | 0.35 |
| CQADupstackMathematicaRetrieval-Fa | 0.14 | nan | 0.19 | 0.26 |
| CQADupstackPhysicsRetrieval-Fa | 0.33 | nan | 0.37 | 0.41 |
| CQADupstackProgrammersRetrieval-Fa | 0.28 | nan | 0.35 | 0.38 |
| CQADupstackRetrieval-Fa | 0.26 | nan | 0.32 | 0.36 |
| CQADupstackStatsRetrieval-Fa | 0.21 | nan | 0.28 | 0.30 |
| CQADupstackTexRetrieval-Fa | 0.15 | nan | 0.20 | 0.25 |
| CQADupstackUnixRetrieval-Fa | 0.24 | nan | 0.32 | 0.37 |
| CQADupstackWebmastersRetrieval-Fa | 0.3 | nan | 0.34 | 0.38 |
| CQADupstackWordpressRetrieval-Fa | 0.2 | nan | 0.26 | 0.30 |
| ClimateFEVER-Fa | 0.12 | nan | 0.13 | 0.30 |
| DBPedia-Fa | 0.2 | nan | 0.30 | 0.37 |
| DeepSentiPers | 0.66 | nan | 0.61 | 0.73 |
| DigikalamagClassification | 0.78 | nan | 0.87 | 0.91 |
| DigikalamagClustering | 0.49 | nan | 0.40 | 0.79 |
| FarsTail | 0.8 | nan | 0.73 | 0.82 |
| FarsiParaphraseDetection | 0.95 | nan | 0.98 | 1.00 |
| Farsick | 0.66 | nan | 0.71 | 0.77 |
| FiQA2018-Fa | 0.19 | nan | 0.30 | 0.37 |
| HamshahriClustring | 0.66 | nan | 0.67 | 0.76 |
| HotpotQA-Fa | 0.28 | nan | 0.60 | 0.61 |
| MIRACLReranking | 0.53 | nan | 0.65 | 0.66 |
| MIRACLRetrieval | 0.45 | nan | 0.59 | 0.72 |
| MSMARCO-Fa | 0.17 | nan | 0.31 | 0.31 |
| MassiveIntentClassification | 0.68 | 0.82 | 0.60 | 0.92 |
| MassiveScenarioClassification | 0.72 | 0.87 | 0.70 | 0.99 |
| NFCorpus-Fa | 0.25 | nan | 0.29 | 0.31 |
| NLPTwitterAnalysisClassification | 0.77 | nan | 0.76 | 0.79 |
| NLPTwitterAnalysisClustering | 0.81 | nan | 0.78 | 0.86 |
| NQ-Fa | 0.26 | nan | 0.45 | 0.50 |
| ParsinluEntail | 0.78 | nan | 0.65 | 0.78 |
| ParsinluQueryParaphPC | 0.88 | nan | 0.88 | 0.90 |
| PersianFoodSentimentClassification | 0.79 | nan | 0.82 | 0.87 |
| PersianTextEmotion | 0.57 | nan | 0.62 | 0.92 |
| PersianWebDocumentRetrieval | 0.38 | nan | 0.47 | 0.57 |
| Query2Query | 0.71 | nan | 0.67 | 0.82 |
| QuoraRetrieval-Fa | 0.79 | nan | 0.80 | 0.82 |
| SAMSumFa | 0.87 | nan | 0.92 | 0.99 |
| SCIDOCS-Fa | 0.11 | nan | 0.12 | 0.18 |
| SIDClassification | 0.55 | nan | 0.61 | 0.68 |
| SIDClustring | 0.44 | nan | 0.39 | 0.55 |
| SciFact-Fa | 0.43 | nan | 0.60 | 0.74 |
| SentimentDKSF | 0.72 | nan | 0.71 | 0.83 |
| SynPerChatbotConvSAAnger | 0.86 | nan | 0.72 | 0.97 |
| SynPerChatbotConvSAClassification | 0.73 | nan | 0.61 | 0.90 |
| SynPerChatbotConvSAFear | 0.83 | nan | 0.74 | 0.96 |
| SynPerChatbotConvSAFriendship | 0.53 | nan | 0.53 | 0.75 |
| SynPerChatbotConvSAHappiness | 0.65 | nan | 0.52 | 0.92 |
| SynPerChatbotConvSAJealousy | 0.73 | nan | 0.70 | 0.84 |
| SynPerChatbotConvSALove | 0.67 | nan | 0.46 | 0.86 |
| SynPerChatbotConvSASadness | 0.82 | nan | 0.64 | 0.95 |
| SynPerChatbotConvSASatisfaction | 0.86 | nan | 0.61 | 0.98 |
| SynPerChatbotConvSASurprise | 0.64 | nan | 0.54 | 0.86 |
| SynPerChatbotConvSAToneChatbotClassification | 0.65 | nan | 0.58 | 0.99 |
| SynPerChatbotConvSAToneUserClassification | 0.59 | nan | 0.53 | 0.97 |
| SynPerChatbotRAGFAQPC | 0.67 | nan | 0.63 | 0.93 |
| SynPerChatbotRAGFAQRetrieval | 0.27 | nan | 0.23 | 0.54 |
| SynPerChatbotRAGSumSRetrieval | 0.48 | nan | 0.50 | 0.80 |
| SynPerChatbotRAGToneChatbotClassification | 0.38 | nan | 0.37 | 0.90 |
| SynPerChatbotRAGToneUserClassification | 0.53 | nan | 0.51 | 0.93 |
| SynPerChatbotRAGTopicsRetrieval | 0.2 | nan | 0.19 | 0.50 |
| SynPerChatbotSatisfactionLevelClassification | 0.35 | nan | 0.25 | 0.60 |
| SynPerChatbotSumSRetrieval | 0.24 | nan | 0.28 | 0.70 |
| SynPerChatbotToneChatbotClassification | 0.48 | nan | 0.41 | 0.95 |
| SynPerChatbotToneUserClassification | 0.49 | nan | 0.47 | 0.95 |
| SynPerChatbotTopicsRetrieval | 0.13 | nan | 0.12 | 0.52 |
| SynPerQAPC | 0.93 | nan | 0.95 | 0.98 |
| SynPerQARetrieval | 0.82 | nan | 0.87 | 0.90 |
| SynPerSTS | 0.88 | nan | 0.88 | 0.90 |
| SynPerTextKeywordsPC | 0.95 | nan | 0.95 | 0.99 |
| SynPerTextToneClassification | 0.72 | nan | 0.70 | 0.91 |
| TRECCOVID-Fa | 0.51 | nan | 0.72 | 0.77 |
| Touche2020-Fa | 0.18 | nan | 0.26 | 0.26 |
| WikipediaRerankingMultilingual | 0.86 | 0.92 | 0.89 | 0.92 |
| WikipediaRetrievalMultilingual | 0.87 | 0.94 | 0.90 | 0.94 |
| Average | 0.53 | 0.89 | 0.54 | 0.70 |
Results for PartAI/Tooka-SBERT-V2-Small
| task_name | PartAI/Tooka-SBERT-V2-Small | google/gemini-embedding-001 | intfloat/multilingual-e5-large | Max result |
|---|---|---|---|---|
| ArguAna-Fa | 0.41 | nan | 0.45 | 0.62 |
| BeytooteClustering | 0.61 | nan | 0.62 | 0.68 |
| CExaPPC | 0.95 | nan | 0.99 | 0.99 |
| CQADupstackAndroidRetrieval-Fa | 0.33 | nan | 0.42 | 0.47 |
| CQADupstackEnglishRetrieval-Fa | 0.22 | nan | 0.30 | 0.35 |
| CQADupstackGamingRetrieval-Fa | 0.35 | nan | 0.46 | 0.49 |
| CQADupstackGisRetrieval-Fa | 0.21 | nan | 0.30 | 0.35 |
| CQADupstackMathematicaRetrieval-Fa | 0.15 | nan | 0.19 | 0.26 |
| CQADupstackPhysicsRetrieval-Fa | 0.31 | nan | 0.37 | 0.41 |
| CQADupstackProgrammersRetrieval-Fa | 0.26 | nan | 0.35 | 0.38 |
| CQADupstackRetrieval-Fa | 0.23 | nan | 0.32 | 0.36 |
| CQADupstackStatsRetrieval-Fa | 0.2 | nan | 0.28 | 0.30 |
| CQADupstackTexRetrieval-Fa | 0.13 | nan | 0.20 | 0.25 |
| CQADupstackUnixRetrieval-Fa | 0.22 | nan | 0.32 | 0.37 |
| CQADupstackWebmastersRetrieval-Fa | 0.27 | nan | 0.34 | 0.38 |
| CQADupstackWordpressRetrieval-Fa | 0.17 | nan | 0.26 | 0.30 |
| ClimateFEVER-Fa | 0.15 | nan | 0.13 | 0.30 |
| DBPedia-Fa | 0.23 | nan | 0.30 | 0.37 |
| DeepSentiPers | 0.62 | nan | 0.61 | 0.73 |
| DigikalamagClassification | 0.77 | nan | 0.87 | 0.91 |
| DigikalamagClustering | 0.48 | nan | 0.40 | 0.79 |
| FarsTail | 0.76 | nan | 0.73 | 0.82 |
| FarsiParaphraseDetection | 0.95 | nan | 0.98 | 1.00 |
| Farsick | 0.64 | nan | 0.71 | 0.77 |
| FiQA2018-Fa | 0.16 | nan | 0.30 | 0.37 |
| HamshahriClustring | 0.65 | nan | 0.67 | 0.76 |
| HotpotQA-Fa | 0.29 | nan | 0.60 | 0.61 |
| MIRACLReranking | 0.55 | nan | 0.65 | 0.66 |
| MIRACLRetrieval | 0.51 | nan | 0.59 | 0.72 |
| MSMARCO-Fa | 0.17 | nan | 0.31 | 0.31 |
| MassiveIntentClassification | 0.66 | 0.82 | 0.60 | 0.92 |
| MassiveScenarioClassification | 0.69 | 0.87 | 0.70 | 0.99 |
| NFCorpus-Fa | 0.25 | nan | 0.29 | 0.31 |
| NLPTwitterAnalysisClassification | 0.77 | nan | 0.76 | 0.79 |
| NLPTwitterAnalysisClustering | 0.79 | nan | 0.78 | 0.86 |
| NQ-Fa | 0.25 | nan | 0.45 | 0.50 |
| ParsinluEntail | 0.73 | nan | 0.65 | 0.78 |
| ParsinluQueryParaphPC | 0.87 | nan | 0.88 | 0.90 |
| PersianFoodSentimentClassification | 0.77 | nan | 0.82 | 0.87 |
| PersianTextEmotion | 0.53 | nan | 0.62 | 0.92 |
| PersianWebDocumentRetrieval | 0.45 | nan | 0.47 | 0.57 |
| Query2Query | 0.7 | nan | 0.67 | 0.82 |
| QuoraRetrieval-Fa | 0.76 | nan | 0.80 | 0.82 |
| SAMSumFa | 0.74 | nan | 0.92 | 0.99 |
| SCIDOCS-Fa | 0.11 | nan | 0.12 | 0.18 |
| SIDClassification | 0.55 | nan | 0.61 | 0.68 |
| SIDClustring | 0.42 | nan | 0.39 | 0.55 |
| SciFact-Fa | 0.43 | nan | 0.60 | 0.74 |
| SentimentDKSF | 0.67 | nan | 0.71 | 0.83 |
| SynPerChatbotConvSAAnger | 0.81 | nan | 0.72 | 0.97 |
| SynPerChatbotConvSAClassification | 0.66 | nan | 0.61 | 0.90 |
| SynPerChatbotConvSAFear | 0.81 | nan | 0.74 | 0.96 |
| SynPerChatbotConvSAFriendship | 0.51 | nan | 0.53 | 0.75 |
| SynPerChatbotConvSAHappiness | 0.6 | nan | 0.52 | 0.92 |
| SynPerChatbotConvSAJealousy | 0.59 | nan | 0.70 | 0.84 |
| SynPerChatbotConvSALove | 0.53 | nan | 0.46 | 0.86 |
| SynPerChatbotConvSASadness | 0.76 | nan | 0.64 | 0.95 |
| SynPerChatbotConvSASatisfaction | 0.74 | nan | 0.61 | 0.98 |
| SynPerChatbotConvSASurprise | 0.55 | nan | 0.54 | 0.86 |
| SynPerChatbotConvSAToneChatbotClassification | 0.62 | nan | 0.58 | 0.99 |
| SynPerChatbotConvSAToneUserClassification | 0.58 | nan | 0.53 | 0.97 |
| SynPerChatbotRAGFAQPC | 0.68 | nan | 0.63 | 0.93 |
| SynPerChatbotRAGFAQRetrieval | 0.29 | nan | 0.23 | 0.54 |
| SynPerChatbotRAGSumSRetrieval | 0.41 | nan | 0.50 | 0.80 |
| SynPerChatbotRAGToneChatbotClassification | 0.36 | nan | 0.37 | 0.90 |
| SynPerChatbotRAGToneUserClassification | 0.51 | nan | 0.51 | 0.93 |
| SynPerChatbotRAGTopicsRetrieval | 0.25 | nan | 0.19 | 0.50 |
| SynPerChatbotSatisfactionLevelClassification | 0.29 | nan | 0.25 | 0.60 |
| SynPerChatbotSumSRetrieval | 0.19 | nan | 0.28 | 0.70 |
| SynPerChatbotToneChatbotClassification | 0.43 | nan | 0.41 | 0.95 |
| SynPerChatbotToneUserClassification | 0.47 | nan | 0.47 | 0.95 |
| SynPerChatbotTopicsRetrieval | 0.19 | nan | 0.12 | 0.52 |
| SynPerQAPC | 0.92 | nan | 0.95 | 0.98 |
| SynPerQARetrieval | 0.8 | nan | 0.87 | 0.90 |
| SynPerSTS | 0.86 | nan | 0.88 | 0.90 |
| SynPerTextKeywordsPC | 0.97 | nan | 0.95 | 0.99 |
| SynPerTextToneClassification | 0.61 | nan | 0.70 | 0.91 |
| TRECCOVID-Fa | 0.59 | nan | 0.72 | 0.77 |
| Touche2020-Fa | 0.22 | nan | 0.26 | 0.26 |
| WikipediaRerankingMultilingual | 0.85 | 0.92 | 0.89 | 0.92 |
| WikipediaRetrievalMultilingual | 0.86 | 0.94 | 0.90 | 0.94 |
| Average | 0.51 | 0.89 | 0.54 | 0.70 |
|
@mehran-sarmadi many of these scores are waay above the baseline e5? Do you have any ideas why this might be? Especially for the |
The main reason for this is that the Hakim and Hakim-small models were fine-tuned on the training portions of the SynPerChatbot datasets during Stage 3 (supervised fine-tuning). As a result, their "zero-shot" scores on the leaderboard are much lower than those of models like E5 — for example, Hakim and Hakim-small score around 33%, whereas E5 scores approximately 91%. In contrast, the Hakim-unsupervised model did not undergo Stage 3 fine-tuning, and its zero-shot score is also around 90%, so its results are much closer to the E5 baseline — only slightly better, likely due to its exposure to a large volume of high-quality Persian data. |
|
Thanks for the clarification @mehran-sarmadi, we will have to add this to the model card. You can add it like so: |
Hi @KennethEnevoldsen |
|
Ah, sorry I overlooked that - then everything is good here! |
In this PR, we add results for five models: three from the Hakim family (hakim, hakim-small, and hakim-unsup), and two from the TookaSBERTV2 family (TookaSBERT-V2-Small and TookaSBERT-V2-Large).
Checklist
mteb/models/this can be as an API. Instruction on how to add a model can be found here