Add Codefuse models by Geralt-Targaryen · Pull Request #277 · embeddings-benchmark/results

Geralt-Targaryen · 2025-09-22T13:43:32Z

Checklist

My model has a model sheet, report or similar
My model has a reference implementation in mteb/models/ this can be as an API. Instruction on how to add a model can be found here
- [] No, but there is an existing PR here
The results submitted is obtained using the reference implementation
My model is available, either as a publicly accessible API or publicly on e.g., Huggingface
I solemnly swear that for all results submitted I have not trained on the evaluation dataset including training splits. If I have I have disclosed it clearly.

github-actions · 2025-09-22T13:55:01Z

Model Results Comparison

Reference models: intfloat/multilingual-e5-large, google/gemini-embedding-001
New models evaluated: codefuse-ai/F2LLM-0.6B, codefuse-ai/F2LLM-1.7B, codefuse-ai/F2LLM-4B
Tasks: AmazonCounterfactualClassification, ArXivHierarchicalClusteringP2P, ArXivHierarchicalClusteringS2S, ArguAna, AskUbuntuDupQuestions, BIOSSES, Banking77Classification, BiorxivClusteringP2P.v2, CQADupstackGamingRetrieval, CQADupstackUnixRetrieval, ClimateFEVERHardNegatives, FEVERHardNegatives, FiQA2018, HotpotQAHardNegatives, ImdbClassification, MTOPDomainClassification, MassiveIntentClassification, MassiveScenarioClassification, MedrxivClusteringP2P.v2, MedrxivClusteringS2S.v2, MindSmallReranking, SCIDOCS, SICK-R, STS12, STS13, STS14, STS15, STS17, STS22.v2, STSBenchmark, SprintDuplicateQuestions, StackExchangeClustering.v2, StackExchangeClusteringP2P.v2, SummEvalSummarization.v2, TRECCOVID, Touche2020Retrieval.v3, ToxicConversationsClassification, TweetSentimentExtractionClassification, TwentyNewsgroupsClustering.v2, TwitterSemEval2015, TwitterURLCorpus

Results for `codefuse-ai/F2LLM-0.6B`

task_name	codefuse-ai/F2LLM-0.6B	google/gemini-embedding-001	intfloat/multilingual-e5-large	Max result
AmazonCounterfactualClassification	0.9472	0.9289	nan	0.9696
ArXivHierarchicalClusteringP2P	0.6632	0.6492	0.5569	0.6869
ArXivHierarchicalClusteringS2S	0.6400	0.6384	0.5367	0.6548
ArguAna	0.5861	0.8644	0.5436	0.8979
AskUbuntuDupQuestions	0.6455	0.6424	0.5924	0.7020
BIOSSES	0.8363	0.8897	0.8457	0.9692
Banking77Classification	0.8901	0.9427	0.7492	0.9427
BiorxivClusteringP2P.v2	0.6494	0.5386	0.372	0.5642
CQADupstackGamingRetrieval	0.6035	0.7068	0.587	0.7861
CQADupstackUnixRetrieval	0.5239	0.5369	0.3988	0.7198
ClimateFEVERHardNegatives	0.4384	0.3106	0.26	0.4900
FEVERHardNegatives	0.8878	0.8898	0.8379	0.9453
FiQA2018	0.4769	0.6178	0.4381	0.7991
HotpotQAHardNegatives	0.6951	0.8701	0.7055	0.8701
ImdbClassification	0.9564	0.9498	0.8867	0.9737
MTOPDomainClassification	0.9918	0.9927	0.9097	0.9995
MassiveIntentClassification	0.8497	0.8846	0.6804	0.9194
MassiveScenarioClassification	0.9063	0.9208	0.7178	0.9930
MedrxivClusteringP2P.v2	0.5617	0.4716	0.3431	0.5179
MedrxivClusteringS2S.v2	0.5372	0.4501	0.3152	0.5106
MindSmallReranking	0.3122	0.3295	0.3024	0.3437
SCIDOCS	0.2263	0.2515	0.1745	0.3453
SICK-R	0.8014	0.8275	0.8023	0.9465
STS12	0.7959	0.8155	0.8002	0.9546
STS13	0.8649	0.8989	0.8155	0.9776
STS14	0.8322	0.8541	0.7772	0.9753
STS15	0.8778	0.9044	0.8931	0.9811
STS17	0.9020	0.9161	0.8812	0.9586
STS22.v2	0.6446	0.6797	0.6366	0.7984
STSBenchmark	0.8679	0.8908	0.8729	0.9504
SprintDuplicateQuestions	0.9466	0.9690	0.9314	0.9838
StackExchangeClustering.v2	0.7366	0.9207	0.4643	0.9207
StackExchangeClusteringP2P.v2	0.4936	0.5091	0.3854	0.5510
SummEvalSummarization.v2	0.2454	0.3828	0.3141	0.3893
TRECCOVID	0.5867	0.8631	0.7115	0.9499
Touche2020Retrieval.v3	0.5454	0.5239	0.4959	0.7465
ToxicConversationsClassification	0.9189	0.8875	0.6601	0.9759
TweetSentimentExtractionClassification	0.7894	0.6988	0.628	0.8823
TwentyNewsgroupsClustering.v2	0.5468	0.5737	0.3921	0.8758
TwitterSemEval2015	0.6622	0.7917	0.7528	0.8946
TwitterURLCorpus	0.8359	0.8705	0.8583	0.9571
Average	0.7005	0.7330	0.6207	0.8115

Model have high performance on these tasks: BiorxivClusteringP2P.v2,MedrxivClusteringP2P.v2,MedrxivClusteringS2S.v2

Results for `codefuse-ai/F2LLM-1.7B`

task_name	codefuse-ai/F2LLM-1.7B	google/gemini-embedding-001	intfloat/multilingual-e5-large	Max result
AmazonCounterfactualClassification	0.9395	0.9289	nan	0.9696
ArXivHierarchicalClusteringP2P	0.6661	0.6492	0.5569	0.6869
ArXivHierarchicalClusteringS2S	0.6429	0.6384	0.5367	0.6548
ArguAna	0.6097	0.8644	0.5436	0.8979
AskUbuntuDupQuestions	0.6736	0.6424	0.5924	0.7020
BIOSSES	0.8780	0.8897	0.8457	0.9692
Banking77Classification	0.9045	0.9427	0.7492	0.9427
BiorxivClusteringP2P.v2	0.7375	0.5386	0.372	0.5642
CQADupstackGamingRetrieval	0.6368	0.7068	0.587	0.7861
CQADupstackUnixRetrieval	0.5636	0.5369	0.3988	0.7198
ClimateFEVERHardNegatives	0.4063	0.3106	0.26	0.4900
FEVERHardNegatives	0.8930	0.8898	0.8379	0.9453
FiQA2018	0.5369	0.6178	0.4381	0.7991
HotpotQAHardNegatives	0.7164	0.8701	0.7055	0.8701
ImdbClassification	0.9633	0.9498	0.8867	0.9737
MTOPDomainClassification	0.9924	0.9927	0.9097	0.9995
MassiveIntentClassification	0.8612	0.8846	0.6804	0.9194
MassiveScenarioClassification	0.9148	0.9208	0.7178	0.9930
MedrxivClusteringP2P.v2	0.6131	0.4716	0.3431	0.5179
MedrxivClusteringS2S.v2	0.5934	0.4501	0.3152	0.5106
MindSmallReranking	0.3232	0.3295	0.3024	0.3437
SCIDOCS	0.2472	0.2515	0.1745	0.3453
SICK-R	0.8143	0.8275	0.8023	0.9465
STS12	0.8070	0.8155	0.8002	0.9546
STS13	0.8795	0.8989	0.8155	0.9776
STS14	0.8409	0.8541	0.7772	0.9753
STS15	0.8858	0.9044	0.8931	0.9811
STS17	0.9032	0.9161	0.8812	0.9586
STS22.v2	0.6683	0.6797	0.6366	0.7984
STSBenchmark	0.8736	0.8908	0.8729	0.9504
SprintDuplicateQuestions	0.9407	0.9690	0.9314	0.9838
StackExchangeClustering.v2	0.7650	0.9207	0.4643	0.9207
StackExchangeClusteringP2P.v2	0.5041	0.5091	0.3854	0.5510
SummEvalSummarization.v2	0.2988	0.3828	0.3141	0.3893
TRECCOVID	0.6204	0.8631	0.7115	0.9499
Touche2020Retrieval.v3	0.5522	0.5239	0.4959	0.7465
ToxicConversationsClassification	0.9036	0.8875	0.6601	0.9759
TweetSentimentExtractionClassification	0.7983	0.6988	0.628	0.8823
TwentyNewsgroupsClustering.v2	0.6079	0.5737	0.3921	0.8758
TwitterSemEval2015	0.6985	0.7917	0.7528	0.8946
TwitterURLCorpus	0.8589	0.8705	0.8583	0.9571
Average	0.7204	0.7330	0.6207	0.8115

Model have high performance on these tasks: BiorxivClusteringP2P.v2,MedrxivClusteringP2P.v2,MedrxivClusteringS2S.v2

Results for `codefuse-ai/F2LLM-4B`

task_name	codefuse-ai/F2LLM-4B	google/gemini-embedding-001	intfloat/multilingual-e5-large	Max result
AmazonCounterfactualClassification	0.9350	0.9289	nan	0.9696
ArXivHierarchicalClusteringP2P	0.6560	0.6492	0.5569	0.6869
ArXivHierarchicalClusteringS2S	0.6450	0.6384	0.5367	0.6548
ArguAna	0.6193	0.8644	0.5436	0.8979
AskUbuntuDupQuestions	0.6707	0.6424	0.5924	0.7020
BIOSSES	0.8756	0.8897	0.8457	0.9692
Banking77Classification	0.9186	0.9427	0.7492	0.9427
BiorxivClusteringP2P.v2	0.8417	0.5386	0.372	0.5642
CQADupstackGamingRetrieval	0.6537	0.7068	0.587	0.7861
CQADupstackUnixRetrieval	0.5901	0.5369	0.3988	0.7198
ClimateFEVERHardNegatives	0.4339	0.3106	0.26	0.4900
FEVERHardNegatives	0.9187	0.8898	0.8379	0.9453
FiQA2018	0.5839	0.6178	0.4381	0.7991
HotpotQAHardNegatives	0.7311	0.8701	0.7055	0.8701
ImdbClassification	0.9688	0.9498	0.8867	0.9737
MTOPDomainClassification	0.9930	0.9927	0.9097	0.9995
MassiveIntentClassification	0.8784	0.8846	0.6804	0.9194
MassiveScenarioClassification	0.9225	0.9208	0.7178	0.9930
MedrxivClusteringP2P.v2	0.7199	0.4716	0.3431	0.5179
MedrxivClusteringS2S.v2	0.7023	0.4501	0.3152	0.5106
MindSmallReranking	0.3303	0.3295	0.3024	0.3437
SCIDOCS	0.2670	0.2515	0.1745	0.3453
SICK-R	0.8170	0.8275	0.8023	0.9465
STS12	0.8164	0.8155	0.8002	0.9546
STS13	0.8930	0.8989	0.8155	0.9776
STS14	0.8547	0.8541	0.7772	0.9753
STS15	0.8909	0.9044	0.8931	0.9811
STS17	0.8931	0.9161	0.8812	0.9586
STS22.v2	0.6654	0.6797	0.6366	0.7984
STSBenchmark	0.8723	0.8908	0.8729	0.9504
SprintDuplicateQuestions	0.9117	0.9690	0.9314	0.9838
StackExchangeClustering.v2	0.7882	0.9207	0.4643	0.9207
StackExchangeClusteringP2P.v2	0.5014	0.5091	0.3854	0.5510
SummEvalSummarization.v2	0.3319	0.3828	0.3141	0.3893
TRECCOVID	0.6064	0.8631	0.7115	0.9499
Touche2020Retrieval.v3	0.5589	0.5239	0.4959	0.7465
ToxicConversationsClassification	0.9202	0.8875	0.6601	0.9759
TweetSentimentExtractionClassification	0.8030	0.6988	0.628	0.8823
TwentyNewsgroupsClustering.v2	0.6288	0.5737	0.3921	0.8758
TwitterSemEval2015	0.7430	0.7917	0.7528	0.8946
TwitterURLCorpus	0.8580	0.8705	0.8583	0.9571
Average	0.7368	0.7330	0.6207	0.8115

Model have high performance on these tasks: BiorxivClusteringP2P.v2,MedrxivClusteringP2P.v2,MedrxivClusteringS2S.v2

Geralt-Targaryen · 2025-09-25T08:22:16Z

Hi, the implementation has been merged into the mteb repo here. Could you please review the results? Thanks! @KennethEnevoldsen @Samoed

KennethEnevoldsen · 2025-09-27T09:41:55Z

Sorry for the late review @Geralt-Targaryen - looks good here

Geralt-Targaryen added 5 commits September 22, 2025 20:12

F2LLM 0.6B & 1.7B results

0cd3260

F2LLM 4B results

316801c

update model meta

6fd7265

update model meta

e393220

update model meta

4338b1f

KennethEnevoldsen added the waiting for review of implementation This PR is waiting for an implementation review before merging the results. label Sep 22, 2025

Samoed removed the waiting for review of implementation This PR is waiting for an implementation review before merging the results. label Sep 25, 2025

KennethEnevoldsen merged commit 8fd0714 into embeddings-benchmark:main Sep 27, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Codefuse models#277

Add Codefuse models#277
KennethEnevoldsen merged 5 commits intoembeddings-benchmark:mainfrom
Geralt-Targaryen:main

Geralt-Targaryen commented Sep 22, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Sep 22, 2025

Uh oh!

Geralt-Targaryen commented Sep 25, 2025

Uh oh!

KennethEnevoldsen commented Sep 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Geralt-Targaryen commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

github-actions bot commented Sep 22, 2025

Model Results Comparison

Results for codefuse-ai/F2LLM-0.6B

Results for codefuse-ai/F2LLM-1.7B

Results for codefuse-ai/F2LLM-4B

Uh oh!

Geralt-Targaryen commented Sep 25, 2025

Uh oh!

KennethEnevoldsen commented Sep 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Geralt-Targaryen commented Sep 22, 2025 •

edited

Loading

Results for `codefuse-ai/F2LLM-0.6B`

Results for `codefuse-ai/F2LLM-1.7B`

Results for `codefuse-ai/F2LLM-4B`