Skip to content

Add Codefuse models#277

Merged
KennethEnevoldsen merged 5 commits intoembeddings-benchmark:mainfrom
Geralt-Targaryen:main
Sep 27, 2025
Merged

Add Codefuse models#277
KennethEnevoldsen merged 5 commits intoembeddings-benchmark:mainfrom
Geralt-Targaryen:main

Conversation

@Geralt-Targaryen
Copy link
Contributor

@Geralt-Targaryen Geralt-Targaryen commented Sep 22, 2025

Checklist

  • My model has a model sheet, report or similar
  • My model has a reference implementation in mteb/models/ this can be as an API. Instruction on how to add a model can be found here
    • [] No, but there is an existing PR here
  • The results submitted is obtained using the reference implementation
  • My model is available, either as a publicly accessible API or publicly on e.g., Huggingface
  • I solemnly swear that for all results submitted I have not trained on the evaluation dataset including training splits. If I have I have disclosed it clearly.

@github-actions
Copy link

Model Results Comparison

Reference models: intfloat/multilingual-e5-large, google/gemini-embedding-001
New models evaluated: codefuse-ai/F2LLM-0.6B, codefuse-ai/F2LLM-1.7B, codefuse-ai/F2LLM-4B
Tasks: AmazonCounterfactualClassification, ArXivHierarchicalClusteringP2P, ArXivHierarchicalClusteringS2S, ArguAna, AskUbuntuDupQuestions, BIOSSES, Banking77Classification, BiorxivClusteringP2P.v2, CQADupstackGamingRetrieval, CQADupstackUnixRetrieval, ClimateFEVERHardNegatives, FEVERHardNegatives, FiQA2018, HotpotQAHardNegatives, ImdbClassification, MTOPDomainClassification, MassiveIntentClassification, MassiveScenarioClassification, MedrxivClusteringP2P.v2, MedrxivClusteringS2S.v2, MindSmallReranking, SCIDOCS, SICK-R, STS12, STS13, STS14, STS15, STS17, STS22.v2, STSBenchmark, SprintDuplicateQuestions, StackExchangeClustering.v2, StackExchangeClusteringP2P.v2, SummEvalSummarization.v2, TRECCOVID, Touche2020Retrieval.v3, ToxicConversationsClassification, TweetSentimentExtractionClassification, TwentyNewsgroupsClustering.v2, TwitterSemEval2015, TwitterURLCorpus

Results for codefuse-ai/F2LLM-0.6B

task_name codefuse-ai/F2LLM-0.6B google/gemini-embedding-001 intfloat/multilingual-e5-large Max result
AmazonCounterfactualClassification 0.9472 0.9289 nan 0.9696
ArXivHierarchicalClusteringP2P 0.6632 0.6492 0.5569 0.6869
ArXivHierarchicalClusteringS2S 0.6400 0.6384 0.5367 0.6548
ArguAna 0.5861 0.8644 0.5436 0.8979
AskUbuntuDupQuestions 0.6455 0.6424 0.5924 0.7020
BIOSSES 0.8363 0.8897 0.8457 0.9692
Banking77Classification 0.8901 0.9427 0.7492 0.9427
BiorxivClusteringP2P.v2 0.6494 0.5386 0.372 0.5642
CQADupstackGamingRetrieval 0.6035 0.7068 0.587 0.7861
CQADupstackUnixRetrieval 0.5239 0.5369 0.3988 0.7198
ClimateFEVERHardNegatives 0.4384 0.3106 0.26 0.4900
FEVERHardNegatives 0.8878 0.8898 0.8379 0.9453
FiQA2018 0.4769 0.6178 0.4381 0.7991
HotpotQAHardNegatives 0.6951 0.8701 0.7055 0.8701
ImdbClassification 0.9564 0.9498 0.8867 0.9737
MTOPDomainClassification 0.9918 0.9927 0.9097 0.9995
MassiveIntentClassification 0.8497 0.8846 0.6804 0.9194
MassiveScenarioClassification 0.9063 0.9208 0.7178 0.9930
MedrxivClusteringP2P.v2 0.5617 0.4716 0.3431 0.5179
MedrxivClusteringS2S.v2 0.5372 0.4501 0.3152 0.5106
MindSmallReranking 0.3122 0.3295 0.3024 0.3437
SCIDOCS 0.2263 0.2515 0.1745 0.3453
SICK-R 0.8014 0.8275 0.8023 0.9465
STS12 0.7959 0.8155 0.8002 0.9546
STS13 0.8649 0.8989 0.8155 0.9776
STS14 0.8322 0.8541 0.7772 0.9753
STS15 0.8778 0.9044 0.8931 0.9811
STS17 0.9020 0.9161 0.8812 0.9586
STS22.v2 0.6446 0.6797 0.6366 0.7984
STSBenchmark 0.8679 0.8908 0.8729 0.9504
SprintDuplicateQuestions 0.9466 0.9690 0.9314 0.9838
StackExchangeClustering.v2 0.7366 0.9207 0.4643 0.9207
StackExchangeClusteringP2P.v2 0.4936 0.5091 0.3854 0.5510
SummEvalSummarization.v2 0.2454 0.3828 0.3141 0.3893
TRECCOVID 0.5867 0.8631 0.7115 0.9499
Touche2020Retrieval.v3 0.5454 0.5239 0.4959 0.7465
ToxicConversationsClassification 0.9189 0.8875 0.6601 0.9759
TweetSentimentExtractionClassification 0.7894 0.6988 0.628 0.8823
TwentyNewsgroupsClustering.v2 0.5468 0.5737 0.3921 0.8758
TwitterSemEval2015 0.6622 0.7917 0.7528 0.8946
TwitterURLCorpus 0.8359 0.8705 0.8583 0.9571
Average 0.7005 0.7330 0.6207 0.8115

Model have high performance on these tasks: BiorxivClusteringP2P.v2,MedrxivClusteringP2P.v2,MedrxivClusteringS2S.v2


Results for codefuse-ai/F2LLM-1.7B

task_name codefuse-ai/F2LLM-1.7B google/gemini-embedding-001 intfloat/multilingual-e5-large Max result
AmazonCounterfactualClassification 0.9395 0.9289 nan 0.9696
ArXivHierarchicalClusteringP2P 0.6661 0.6492 0.5569 0.6869
ArXivHierarchicalClusteringS2S 0.6429 0.6384 0.5367 0.6548
ArguAna 0.6097 0.8644 0.5436 0.8979
AskUbuntuDupQuestions 0.6736 0.6424 0.5924 0.7020
BIOSSES 0.8780 0.8897 0.8457 0.9692
Banking77Classification 0.9045 0.9427 0.7492 0.9427
BiorxivClusteringP2P.v2 0.7375 0.5386 0.372 0.5642
CQADupstackGamingRetrieval 0.6368 0.7068 0.587 0.7861
CQADupstackUnixRetrieval 0.5636 0.5369 0.3988 0.7198
ClimateFEVERHardNegatives 0.4063 0.3106 0.26 0.4900
FEVERHardNegatives 0.8930 0.8898 0.8379 0.9453
FiQA2018 0.5369 0.6178 0.4381 0.7991
HotpotQAHardNegatives 0.7164 0.8701 0.7055 0.8701
ImdbClassification 0.9633 0.9498 0.8867 0.9737
MTOPDomainClassification 0.9924 0.9927 0.9097 0.9995
MassiveIntentClassification 0.8612 0.8846 0.6804 0.9194
MassiveScenarioClassification 0.9148 0.9208 0.7178 0.9930
MedrxivClusteringP2P.v2 0.6131 0.4716 0.3431 0.5179
MedrxivClusteringS2S.v2 0.5934 0.4501 0.3152 0.5106
MindSmallReranking 0.3232 0.3295 0.3024 0.3437
SCIDOCS 0.2472 0.2515 0.1745 0.3453
SICK-R 0.8143 0.8275 0.8023 0.9465
STS12 0.8070 0.8155 0.8002 0.9546
STS13 0.8795 0.8989 0.8155 0.9776
STS14 0.8409 0.8541 0.7772 0.9753
STS15 0.8858 0.9044 0.8931 0.9811
STS17 0.9032 0.9161 0.8812 0.9586
STS22.v2 0.6683 0.6797 0.6366 0.7984
STSBenchmark 0.8736 0.8908 0.8729 0.9504
SprintDuplicateQuestions 0.9407 0.9690 0.9314 0.9838
StackExchangeClustering.v2 0.7650 0.9207 0.4643 0.9207
StackExchangeClusteringP2P.v2 0.5041 0.5091 0.3854 0.5510
SummEvalSummarization.v2 0.2988 0.3828 0.3141 0.3893
TRECCOVID 0.6204 0.8631 0.7115 0.9499
Touche2020Retrieval.v3 0.5522 0.5239 0.4959 0.7465
ToxicConversationsClassification 0.9036 0.8875 0.6601 0.9759
TweetSentimentExtractionClassification 0.7983 0.6988 0.628 0.8823
TwentyNewsgroupsClustering.v2 0.6079 0.5737 0.3921 0.8758
TwitterSemEval2015 0.6985 0.7917 0.7528 0.8946
TwitterURLCorpus 0.8589 0.8705 0.8583 0.9571
Average 0.7204 0.7330 0.6207 0.8115

Model have high performance on these tasks: BiorxivClusteringP2P.v2,MedrxivClusteringP2P.v2,MedrxivClusteringS2S.v2


Results for codefuse-ai/F2LLM-4B

task_name codefuse-ai/F2LLM-4B google/gemini-embedding-001 intfloat/multilingual-e5-large Max result
AmazonCounterfactualClassification 0.9350 0.9289 nan 0.9696
ArXivHierarchicalClusteringP2P 0.6560 0.6492 0.5569 0.6869
ArXivHierarchicalClusteringS2S 0.6450 0.6384 0.5367 0.6548
ArguAna 0.6193 0.8644 0.5436 0.8979
AskUbuntuDupQuestions 0.6707 0.6424 0.5924 0.7020
BIOSSES 0.8756 0.8897 0.8457 0.9692
Banking77Classification 0.9186 0.9427 0.7492 0.9427
BiorxivClusteringP2P.v2 0.8417 0.5386 0.372 0.5642
CQADupstackGamingRetrieval 0.6537 0.7068 0.587 0.7861
CQADupstackUnixRetrieval 0.5901 0.5369 0.3988 0.7198
ClimateFEVERHardNegatives 0.4339 0.3106 0.26 0.4900
FEVERHardNegatives 0.9187 0.8898 0.8379 0.9453
FiQA2018 0.5839 0.6178 0.4381 0.7991
HotpotQAHardNegatives 0.7311 0.8701 0.7055 0.8701
ImdbClassification 0.9688 0.9498 0.8867 0.9737
MTOPDomainClassification 0.9930 0.9927 0.9097 0.9995
MassiveIntentClassification 0.8784 0.8846 0.6804 0.9194
MassiveScenarioClassification 0.9225 0.9208 0.7178 0.9930
MedrxivClusteringP2P.v2 0.7199 0.4716 0.3431 0.5179
MedrxivClusteringS2S.v2 0.7023 0.4501 0.3152 0.5106
MindSmallReranking 0.3303 0.3295 0.3024 0.3437
SCIDOCS 0.2670 0.2515 0.1745 0.3453
SICK-R 0.8170 0.8275 0.8023 0.9465
STS12 0.8164 0.8155 0.8002 0.9546
STS13 0.8930 0.8989 0.8155 0.9776
STS14 0.8547 0.8541 0.7772 0.9753
STS15 0.8909 0.9044 0.8931 0.9811
STS17 0.8931 0.9161 0.8812 0.9586
STS22.v2 0.6654 0.6797 0.6366 0.7984
STSBenchmark 0.8723 0.8908 0.8729 0.9504
SprintDuplicateQuestions 0.9117 0.9690 0.9314 0.9838
StackExchangeClustering.v2 0.7882 0.9207 0.4643 0.9207
StackExchangeClusteringP2P.v2 0.5014 0.5091 0.3854 0.5510
SummEvalSummarization.v2 0.3319 0.3828 0.3141 0.3893
TRECCOVID 0.6064 0.8631 0.7115 0.9499
Touche2020Retrieval.v3 0.5589 0.5239 0.4959 0.7465
ToxicConversationsClassification 0.9202 0.8875 0.6601 0.9759
TweetSentimentExtractionClassification 0.8030 0.6988 0.628 0.8823
TwentyNewsgroupsClustering.v2 0.6288 0.5737 0.3921 0.8758
TwitterSemEval2015 0.7430 0.7917 0.7528 0.8946
TwitterURLCorpus 0.8580 0.8705 0.8583 0.9571
Average 0.7368 0.7330 0.6207 0.8115

Model have high performance on these tasks: BiorxivClusteringP2P.v2,MedrxivClusteringP2P.v2,MedrxivClusteringS2S.v2


@KennethEnevoldsen KennethEnevoldsen added the waiting for review of implementation This PR is waiting for an implementation review before merging the results. label Sep 22, 2025
@Geralt-Targaryen
Copy link
Contributor Author

Hi, the implementation has been merged into the mteb repo here. Could you please review the results? Thanks! @KennethEnevoldsen @Samoed

@Samoed Samoed removed the waiting for review of implementation This PR is waiting for an implementation review before merging the results. label Sep 25, 2025
@KennethEnevoldsen
Copy link
Contributor

Sorry for the late review @Geralt-Targaryen - looks good here

@KennethEnevoldsen KennethEnevoldsen merged commit 8fd0714 into embeddings-benchmark:main Sep 27, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants