Skip to content

kalm-emb-v2.5 results#303

Merged
KennethEnevoldsen merged 2 commits intoembeddings-benchmark:mainfrom
KaLM-Embedding:main
Oct 20, 2025
Merged

kalm-emb-v2.5 results#303
KennethEnevoldsen merged 2 commits intoembeddings-benchmark:mainfrom
KaLM-Embedding:main

Conversation

@ItsukiFujii
Copy link
Contributor

Checklist

  • My model has a model sheet, report or similar
  • My model has a reference implementation in mteb/models/ this can be as an API. Instruction on how to add a model can be found here
  • The results submitted is obtained using the reference implementation
  • My model is available, either as a publicly accessible API or publicly on e.g., Huggingface
  • I solemnly swear that for all results submitted I have not on the evaluation dataset including training splits. If I have I have disclosed it clearly.

@github-actions
Copy link

Model Results Comparison

Reference models: intfloat/multilingual-e5-large, google/gemini-embedding-001
New models evaluated: KaLM-Embedding/KaLM-embedding-multilingual-mini-instruct-v2.5
Tasks: AFQMC, ATEC, AmazonCounterfactualClassification, AmazonPolarityClassification, AmazonReviewsClassification, ArguAna, ArxivClusteringP2P, ArxivClusteringS2S, AskUbuntuDupQuestions, BIOSSES, BQ, Banking77Classification, BiorxivClusteringP2P, BiorxivClusteringS2S, CLSClusteringP2P, CLSClusteringS2S, CMedQAv1-reranking, CMedQAv2-reranking, CQADupstackAndroidRetrieval, CQADupstackEnglishRetrieval, CQADupstackGamingRetrieval, CQADupstackGisRetrieval, CQADupstackMathematicaRetrieval, CQADupstackPhysicsRetrieval, CQADupstackProgrammersRetrieval, CQADupstackStatsRetrieval, CQADupstackTexRetrieval, CQADupstackUnixRetrieval, CQADupstackWebmastersRetrieval, CQADupstackWordpressRetrieval, ClimateFEVER, CmedqaRetrieval, Cmnli, CovidRetrieval, DBPedia, DuRetrieval, EcomRetrieval, EmotionClassification, FEVER, FiQA2018, HotpotQA, IFlyTek, ImdbClassification, JDReview, LCQMC, MMarcoReranking, MMarcoRetrieval, MSMARCO, MTOPDomainClassification, MTOPIntentClassification, MassiveIntentClassification, MassiveScenarioClassification, MedicalRetrieval, MedrxivClusteringP2P, MedrxivClusteringS2S, MindSmallReranking, MultilingualSentiment, NFCorpus, NQ, Ocnli, OnlineShopping, PAWSX, QBQTC, QuoraRetrieval, RedditClustering, RedditClusteringP2P, SCIDOCS, SICK-R, STS12, STS13, STS14, STS15, STS16, STS17, STS22, STSB, STSBenchmark, SciDocsRR, SciFact, SprintDuplicateQuestions, StackExchangeClustering, StackExchangeClusteringP2P, StackOverflowDupQuestions, SummEval, T2Reranking, T2Retrieval, TNews, TRECCOVID, ThuNewsClusteringP2P, ThuNewsClusteringS2S, Touche2020, ToxicConversationsClassification, TweetSentimentExtractionClassification, TwentyNewsgroupsClustering, TwitterSemEval2015, TwitterURLCorpus, VideoRetrieval, Waimai

Results for KaLM-Embedding/KaLM-embedding-multilingual-mini-instruct-v2.5

task_name KaLM-Embedding/KaLM-embedding-multilingual-mini-instruct-v2.5 google/gemini-embedding-001 intfloat/multilingual-e5-large Max result
AFQMC 0.4878 nan 0.3301 0.7225
ATEC 0.5273 nan 0.3981 0.6464
AmazonCounterfactualClassification 0.9548 0.9289 0.7974 0.9696
AmazonPolarityClassification 0.9703 nan 0.9349 0.9774
AmazonReviewsClassification 0.6415 nan 0.492 0.6880
ArguAna 0.6015 0.8644 0.5438 0.8979
ArxivClusteringP2P 0.5211 nan 0.4431 0.6092
ArxivClusteringS2S 0.451 nan 0.3843 0.5520
AskUbuntuDupQuestions 0.6239 0.6424 0.6028 0.7020
BIOSSES 0.8402 0.8897 0.8457 0.9692
BQ 0.7114 nan 0.485 0.8125
Banking77Classification 0.9031 0.9427 0.8473 0.9427
BiorxivClusteringP2P 0.4851 nan 0.355 0.5522
BiorxivClusteringS2S 0.4275 nan 0.335 0.5093
CLSClusteringP2P 0.6625 nan nan 0.8225
CLSClusteringS2S 0.6273 nan nan 0.7627
CMedQAv1-reranking 0.8458 nan 0.6765 0.9434
CMedQAv2-reranking 0.8578 nan 0.6678 0.9353
CQADupstackAndroidRetrieval 0.5714 nan 0.4904 0.7426
CQADupstackEnglishRetrieval 0.5213 nan 0.4581 0.6998
CQADupstackGamingRetrieval 0.6552 0.7068 0.587 0.7861
CQADupstackGisRetrieval 0.453 nan 0.3695 0.6340
CQADupstackMathematicaRetrieval 0.3606 nan 0.2818 0.6948
CQADupstackPhysicsRetrieval 0.5168 nan 0.4366 0.7371
CQADupstackProgrammersRetrieval 0.4925 nan 0.416 0.6587
CQADupstackStatsRetrieval 0.405 nan 0.3238 0.6242
CQADupstackTexRetrieval 0.3523 nan 0.2836 0.6295
CQADupstackUnixRetrieval 0.4887 0.5369 0.3988 0.7198
CQADupstackWebmastersRetrieval 0.4711 nan 0.3988 0.6835
CQADupstackWordpressRetrieval 0.3757 nan 0.3164 0.5862
ClimateFEVER 0.345 nan 0.2573 0.5693
CmedqaRetrieval 0.4587 nan 0.2866 0.5658
Cmnli 0.861 nan nan 0.9579
CovidRetrieval 0.8357 0.7913 0.7561 0.9606
DBPedia 0.4262 nan 0.413 0.5350
DuRetrieval 0.8614 nan 0.853 0.9423
EcomRetrieval 0.6668 nan 0.5467 0.7881
EmotionClassification 0.838 nan 0.4758 0.9387
FEVER 0.8789 nan 0.8281 0.9628
FiQA2018 0.471 0.6178 0.4381 0.8206
HotpotQA 0.7176 nan 0.7123 0.8758
IFlyTek 0.5659 nan 0.4186 0.5973
ImdbClassification 0.9591 0.9498 0.9023 0.9737
JDReview 0.8882 nan 0.8054 0.9214
LCQMC 0.775 nan 0.7595 0.8354
MMarcoReranking 0.2964 nan 0.2912 0.4689
MMarcoRetrieval 0.8223 nan 0.792 0.9033
MSMARCO 0.4062 nan 0.437 0.4812
MTOPDomainClassification 0.9869 0.9927 0.9367 0.9995
MTOPIntentClassification 0.911 nan 0.779 0.9551
MassiveIntentClassification 0.8324 0.8846 0.7376 0.9194
MassiveScenarioClassification 0.8935 0.9208 0.7751 0.9930
MedicalRetrieval 0.6046 nan 0.5144 0.7562
MedrxivClusteringP2P 0.4309 nan 0.317 0.5153
MedrxivClusteringS2S 0.4043 nan 0.2976 0.4969
MindSmallReranking 0.3245 0.3295 0.3142 0.3437
MultilingualSentiment 0.8057 nan 0.709 0.8536
NFCorpus 0.3711 nan 0.3399 0.5575
NQ 0.5861 nan 0.6406 0.8248
Ocnli 0.8212 nan nan 0.9518
OnlineShopping 0.9502 nan 0.9045 0.9716
PAWSX 0.479 nan 0.1463 0.7331
QBQTC 0.3983 nan nan 0.7145
QuoraRetrieval 0.8957 nan 0.8926 0.9235
RedditClustering 0.7689 nan 0.4691 0.7716
RedditClusteringP2P 0.7284 nan 0.6322 0.7527
SCIDOCS 0.2162 0.2515 0.1747 0.3453
SICK-R 0.832 0.8275 0.8023 0.9465
STS12 0.819 0.8155 0.8002 0.9546
STS13 0.8952 0.8989 0.8155 0.9776
STS14 0.8599 0.8541 0.7772 0.9753
STS15 0.9033 0.9044 0.8931 0.9811
STS16 0.8774 nan 0.8579 0.9763
STS17 0.8135 0.8887 0.8209 0.9323
STS22 0.7136 nan 0.6485 0.7743
STSB 0.829 0.8550 0.8236 0.9199
STSBenchmark 0.8888 0.8908 0.8729 0.9504
SciDocsRR 0.8468 nan 0.8422 0.9114
SciFact 0.7438 nan 0.7042 0.8660
SprintDuplicateQuestions 0.9609 0.9690 0.9318 0.9838
StackExchangeClustering 0.8022 nan 0.5837 0.8395
StackExchangeClusteringP2P 0.4726 nan 0.329 0.5157
StackOverflowDupQuestions 0.5182 nan 0.5014 0.6292
SummEval 0.3118 nan 0.2969 0.4052
T2Reranking 0.676 0.6795 0.6632 0.7315
T2Retrieval 0.8597 nan 0.7607 0.8926
TNews 0.5327 nan 0.488 0.6090
TRECCOVID 0.8298 0.8631 0.7133 0.9499
ThuNewsClusteringP2P 0.8464 nan nan 0.8976
ThuNewsClusteringS2S 0.7875 nan nan 0.8955
Touche2020 0.2893 nan 0.2339 0.3939
ToxicConversationsClassification 0.917 0.8875 0.7132 0.9759
TweetSentimentExtractionClassification 0.8008 0.6988 0.628 0.8823
TwentyNewsgroupsClustering 0.7326 nan 0.394 0.8349
TwitterSemEval2015 0.7715 0.7917 0.7548 0.8946
TwitterURLCorpus 0.8666 0.8705 0.8589 0.9571
VideoRetrieval 0.7644 nan 0.5828 0.8384
Waimai 0.8991 nan 0.863 0.9231
Average 0.6729 0.7982 0.5869 0.7847

@ItsukiFujii
Copy link
Contributor Author

Hi @Samoed
All checks have passed. It would be nice if you could please review this PR :)

@Samoed
Copy link
Member

Samoed commented Oct 20, 2025

It looks good for me. I requested review for Kenneth

@KennethEnevoldsen
Copy link
Contributor

Looks good to me too!

@KennethEnevoldsen KennethEnevoldsen merged commit f445873 into embeddings-benchmark:main Oct 20, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants