Skip to content

kalm-emb-v2 results#235

Merged
Samoed merged 3 commits intoembeddings-benchmark:mainfrom
ItsukiFujii:main
Jul 16, 2025
Merged

kalm-emb-v2 results#235
Samoed merged 3 commits intoembeddings-benchmark:mainfrom
ItsukiFujii:main

Conversation

@ItsukiFujii
Copy link
Contributor

Checklist

  • My model has a model sheet, report or similar
  • My model has a reference implementation in mteb/models/ this can be as an API. Instruction on how to add a model can be found here
  • The results submitted is obtained using the reference implementation
  • My model is available, either as a publicly accessible API or publicly on e.g., Huggingface
  • I solemnly swear that for all results submitted I have not on the evaluation dataset including training splits. If I have I have disclosed it clearly.

@github-actions
Copy link

github-actions bot commented Jul 9, 2025

Model Results Comparison

Reference models: intfloat/multilingual-e5-large, google/gemini-embedding-001
New models evaluated: HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v2
Tasks: AFQMC, ATEC, AmazonCounterfactualClassification, AmazonPolarityClassification, AmazonReviewsClassification, ArguAna, ArxivClusteringP2P, ArxivClusteringS2S, AskUbuntuDupQuestions, BIOSSES, BQ, Banking77Classification, BiorxivClusteringP2P, BiorxivClusteringS2S, CLSClusteringP2P, CLSClusteringS2S, CMedQAv1-reranking, CMedQAv2-reranking, CQADupstackAndroidRetrieval, CQADupstackEnglishRetrieval, CQADupstackGamingRetrieval, CQADupstackGisRetrieval, CQADupstackMathematicaRetrieval, CQADupstackPhysicsRetrieval, CQADupstackProgrammersRetrieval, CQADupstackStatsRetrieval, CQADupstackTexRetrieval, CQADupstackUnixRetrieval, CQADupstackWebmastersRetrieval, CQADupstackWordpressRetrieval, ClimateFEVER, CmedqaRetrieval, Cmnli, CovidRetrieval, DBPedia, DuRetrieval, EcomRetrieval, EmotionClassification, FEVER, FiQA2018, HotpotQA, IFlyTek, ImdbClassification, JDReview, LCQMC, MMarcoReranking, MMarcoRetrieval, MSMARCO, MTOPDomainClassification, MTOPIntentClassification, MassiveIntentClassification, MassiveScenarioClassification, MedicalRetrieval, MedrxivClusteringP2P, MedrxivClusteringS2S, MindSmallReranking, MultilingualSentiment, NFCorpus, NQ, Ocnli, OnlineShopping, PAWSX, QBQTC, QuoraRetrieval, RedditClustering, RedditClusteringP2P, SCIDOCS, SICK-R, STS12, STS13, STS14, STS15, STS16, STS17, STS22, STSB, STSBenchmark, SciDocsRR, SciFact, SprintDuplicateQuestions, StackExchangeClustering, StackExchangeClusteringP2P, StackOverflowDupQuestions, SummEval, T2Reranking, T2Retrieval, TNews, TRECCOVID, ThuNewsClusteringP2P, ThuNewsClusteringS2S, Touche2020, ToxicConversationsClassification, TweetSentimentExtractionClassification, TwentyNewsgroupsClustering, TwitterSemEval2015, TwitterURLCorpus, VideoRetrieval, Waimai

Results for HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v2

task_name HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v2 google/gemini-embedding-001 intfloat/multilingual-e5-large Max result
AFQMC 0.44 nan 0.33 0.72
ATEC 0.5 nan 0.4 0.65
AmazonCounterfactualClassification 0.96 0.88 0.7 0.97
AmazonPolarityClassification 0.97 nan 0.93 0.98
AmazonReviewsClassification 0.58 nan 0.43 0.65
ArguAna 0.57 0.86 0.54 0.90
ArxivClusteringP2P 0.51 nan 0.44 0.61
ArxivClusteringS2S 0.44 nan 0.38 0.55
AskUbuntuDupQuestions 0.62 0.64 0.59 0.70
BIOSSES 0.84 0.89 0.85 0.97
BQ 0.63 nan 0.48 0.81
Banking77Classification 0.89 0.94 0.75 0.94
BiorxivClusteringP2P 0.48 nan 0.36 0.55
BiorxivClusteringS2S 0.42 nan 0.33 0.51
CLSClusteringP2P 0.63 nan nan 0.82
CLSClusteringS2S 0.59 nan nan 0.74
CMedQAv1-reranking 0.84 nan 0.68 0.92
CMedQAv2-reranking 0.84 nan 0.67 0.92
CQADupstackAndroidRetrieval 0.54 nan 0.49 0.74
CQADupstackEnglishRetrieval 0.49 nan 0.46 0.70
CQADupstackGamingRetrieval 0.62 0.71 0.59 0.79
CQADupstackGisRetrieval 0.43 nan 0.37 0.63
CQADupstackMathematicaRetrieval 0.33 nan 0.28 0.69
CQADupstackPhysicsRetrieval 0.48 nan 0.44 0.74
CQADupstackProgrammersRetrieval 0.45 nan 0.42 0.66
CQADupstackStatsRetrieval 0.37 nan 0.32 0.62
CQADupstackTexRetrieval 0.33 nan 0.28 0.63
CQADupstackUnixRetrieval 0.46 0.54 0.4 0.72
CQADupstackWebmastersRetrieval 0.43 nan 0.4 0.68
CQADupstackWordpressRetrieval 0.36 nan 0.32 0.59
ClimateFEVER 0.25 nan 0.26 0.57
CmedqaRetrieval 0.45 nan 0.29 0.57
Cmnli 0.78 nan nan 0.93
CovidRetrieval 0.83 0.79 0.76 0.96
DBPedia 0.4 nan 0.41 0.53
DuRetrieval 0.83 nan 0.85 0.94
EcomRetrieval 0.65 nan 0.55 0.78
EmotionClassification 0.92 nan 0.48 0.94
FEVER 0.83 nan 0.83 0.96
FiQA2018 0.45 0.62 0.44 0.80
HotpotQA 0.7 nan 0.71 0.88
IFlyTek 0.51 nan 0.42 0.58
ImdbClassification 0.95 0.95 0.89 0.97
JDReview 0.87 nan 0.81 0.92
LCQMC 0.74 nan 0.76 0.81
MMarcoReranking 0.26 nan 0.29 0.47
MMarcoRetrieval 0.81 nan 0.79 0.90
MSMARCO 0.36 nan 0.44 0.48
MTOPDomainClassification 0.99 0.98 0.9 1.00
MTOPIntentClassification 0.89 nan 0.67 0.95
MassiveIntentClassification 0.78 0.82 0.6 0.92
MassiveScenarioClassification 0.86 0.87 0.7 0.99
MedicalRetrieval 0.6 nan 0.51 0.76
MedrxivClusteringP2P 0.44 nan 0.32 0.52
MedrxivClusteringS2S 0.41 nan 0.3 0.50
MindSmallReranking 0.32 0.33 0.3 0.34
MultilingualSentiment 0.78 nan 0.71 0.83
NFCorpus 0.35 nan 0.34 0.56
NQ 0.48 nan 0.64 0.82
Ocnli 0.78 nan nan 0.92
OnlineShopping 0.94 nan 0.9 0.97
PAWSX 0.43 nan 0.15 0.66
QBQTC 0.38 nan nan 0.71
QuoraRetrieval 0.9 nan 0.89 0.92
RedditClustering 0.77 nan 0.47 0.77
RedditClusteringP2P 0.73 nan 0.63 0.75
SCIDOCS 0.21 0.25 0.17 0.35
SICK-R 0.8 0.83 0.8 0.95
STS12 0.82 0.82 0.8 0.95
STS13 0.86 0.90 0.82 0.98
STS14 0.84 0.85 0.78 0.98
STS15 0.86 0.90 0.89 0.98
STS16 0.86 nan 0.86 0.98
STS17 0.67 0.89 0.82 0.93
STS22 0.64 nan 0.59 0.84
STSB 0.81 0.85 0.82 0.92
STSBenchmark 0.85 0.89 0.87 0.94
SciDocsRR 0.82 nan 0.84 0.91
SciFact 0.72 nan 0.7 0.87
SprintDuplicateQuestions 0.96 0.97 0.93 0.98
StackExchangeClustering 0.78 nan 0.58 0.84
StackExchangeClusteringP2P 0.45 nan 0.33 0.52
StackOverflowDupQuestions 0.51 nan 0.5 0.63
SummEval 0.29 nan 0.3 0.41
T2Reranking 0.67 0.68 0.66 0.73
T2Retrieval 0.85 nan 0.76 0.89
TNews 0.51 nan 0.49 0.59
TRECCOVID 0.79 0.86 0.71 0.95
ThuNewsClusteringP2P 0.81 nan nan 0.89
ThuNewsClusteringS2S 0.76 nan nan 0.88
Touche2020 0.28 nan 0.23 0.39
ToxicConversationsClassification 0.89 0.89 0.66 0.98
TweetSentimentExtractionClassification 0.79 0.70 0.63 0.88
TwentyNewsgroupsClustering 0.74 nan 0.39 0.83
TwitterSemEval2015 0.77 0.79 0.75 0.89
TwitterURLCorpus 0.86 0.87 0.86 0.96
VideoRetrieval 0.76 nan 0.58 0.84
Waimai 0.89 nan 0.86 0.92
Average 0.65 0.79 0.58 0.78

@Samoed
Copy link
Member

Samoed commented Jul 9, 2025

Folder dir should be revision of the model instead of external

@ItsukiFujii
Copy link
Contributor Author

The folder has been modified to revision

@ItsukiFujii
Copy link
Contributor Author

@Samoed
I've already submitted a PR for model implementation to mteb, it would be nice if you could please review this PR :)

@ItsukiFujii
Copy link
Contributor Author

Hi @Samoed
All checks have passed. It would be nice if you could please review this PR :)

@Samoed
Copy link
Member

Samoed commented Jul 15, 2025

Looks good. Will merge after implementation

@ItsukiFujii
Copy link
Contributor Author

Looks good. Will merge after implementation

Hi @Samoed
The model implementation has been merged. Please review this PR again.

@Samoed Samoed merged commit 1b4c767 into embeddings-benchmark:main Jul 16, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants