Add Codefuse models#277
Conversation
Model Results ComparisonReference models: Results for
|
| task_name | codefuse-ai/F2LLM-0.6B | google/gemini-embedding-001 | intfloat/multilingual-e5-large | Max result |
|---|---|---|---|---|
| AmazonCounterfactualClassification | 0.9472 | 0.9289 | nan | 0.9696 |
| ArXivHierarchicalClusteringP2P | 0.6632 | 0.6492 | 0.5569 | 0.6869 |
| ArXivHierarchicalClusteringS2S | 0.6400 | 0.6384 | 0.5367 | 0.6548 |
| ArguAna | 0.5861 | 0.8644 | 0.5436 | 0.8979 |
| AskUbuntuDupQuestions | 0.6455 | 0.6424 | 0.5924 | 0.7020 |
| BIOSSES | 0.8363 | 0.8897 | 0.8457 | 0.9692 |
| Banking77Classification | 0.8901 | 0.9427 | 0.7492 | 0.9427 |
| BiorxivClusteringP2P.v2 | 0.6494 | 0.5386 | 0.372 | 0.5642 |
| CQADupstackGamingRetrieval | 0.6035 | 0.7068 | 0.587 | 0.7861 |
| CQADupstackUnixRetrieval | 0.5239 | 0.5369 | 0.3988 | 0.7198 |
| ClimateFEVERHardNegatives | 0.4384 | 0.3106 | 0.26 | 0.4900 |
| FEVERHardNegatives | 0.8878 | 0.8898 | 0.8379 | 0.9453 |
| FiQA2018 | 0.4769 | 0.6178 | 0.4381 | 0.7991 |
| HotpotQAHardNegatives | 0.6951 | 0.8701 | 0.7055 | 0.8701 |
| ImdbClassification | 0.9564 | 0.9498 | 0.8867 | 0.9737 |
| MTOPDomainClassification | 0.9918 | 0.9927 | 0.9097 | 0.9995 |
| MassiveIntentClassification | 0.8497 | 0.8846 | 0.6804 | 0.9194 |
| MassiveScenarioClassification | 0.9063 | 0.9208 | 0.7178 | 0.9930 |
| MedrxivClusteringP2P.v2 | 0.5617 | 0.4716 | 0.3431 | 0.5179 |
| MedrxivClusteringS2S.v2 | 0.5372 | 0.4501 | 0.3152 | 0.5106 |
| MindSmallReranking | 0.3122 | 0.3295 | 0.3024 | 0.3437 |
| SCIDOCS | 0.2263 | 0.2515 | 0.1745 | 0.3453 |
| SICK-R | 0.8014 | 0.8275 | 0.8023 | 0.9465 |
| STS12 | 0.7959 | 0.8155 | 0.8002 | 0.9546 |
| STS13 | 0.8649 | 0.8989 | 0.8155 | 0.9776 |
| STS14 | 0.8322 | 0.8541 | 0.7772 | 0.9753 |
| STS15 | 0.8778 | 0.9044 | 0.8931 | 0.9811 |
| STS17 | 0.9020 | 0.9161 | 0.8812 | 0.9586 |
| STS22.v2 | 0.6446 | 0.6797 | 0.6366 | 0.7984 |
| STSBenchmark | 0.8679 | 0.8908 | 0.8729 | 0.9504 |
| SprintDuplicateQuestions | 0.9466 | 0.9690 | 0.9314 | 0.9838 |
| StackExchangeClustering.v2 | 0.7366 | 0.9207 | 0.4643 | 0.9207 |
| StackExchangeClusteringP2P.v2 | 0.4936 | 0.5091 | 0.3854 | 0.5510 |
| SummEvalSummarization.v2 | 0.2454 | 0.3828 | 0.3141 | 0.3893 |
| TRECCOVID | 0.5867 | 0.8631 | 0.7115 | 0.9499 |
| Touche2020Retrieval.v3 | 0.5454 | 0.5239 | 0.4959 | 0.7465 |
| ToxicConversationsClassification | 0.9189 | 0.8875 | 0.6601 | 0.9759 |
| TweetSentimentExtractionClassification | 0.7894 | 0.6988 | 0.628 | 0.8823 |
| TwentyNewsgroupsClustering.v2 | 0.5468 | 0.5737 | 0.3921 | 0.8758 |
| TwitterSemEval2015 | 0.6622 | 0.7917 | 0.7528 | 0.8946 |
| TwitterURLCorpus | 0.8359 | 0.8705 | 0.8583 | 0.9571 |
| Average | 0.7005 | 0.7330 | 0.6207 | 0.8115 |
Model have high performance on these tasks: BiorxivClusteringP2P.v2,MedrxivClusteringP2P.v2,MedrxivClusteringS2S.v2
Results for codefuse-ai/F2LLM-1.7B
| task_name | codefuse-ai/F2LLM-1.7B | google/gemini-embedding-001 | intfloat/multilingual-e5-large | Max result |
|---|---|---|---|---|
| AmazonCounterfactualClassification | 0.9395 | 0.9289 | nan | 0.9696 |
| ArXivHierarchicalClusteringP2P | 0.6661 | 0.6492 | 0.5569 | 0.6869 |
| ArXivHierarchicalClusteringS2S | 0.6429 | 0.6384 | 0.5367 | 0.6548 |
| ArguAna | 0.6097 | 0.8644 | 0.5436 | 0.8979 |
| AskUbuntuDupQuestions | 0.6736 | 0.6424 | 0.5924 | 0.7020 |
| BIOSSES | 0.8780 | 0.8897 | 0.8457 | 0.9692 |
| Banking77Classification | 0.9045 | 0.9427 | 0.7492 | 0.9427 |
| BiorxivClusteringP2P.v2 | 0.7375 | 0.5386 | 0.372 | 0.5642 |
| CQADupstackGamingRetrieval | 0.6368 | 0.7068 | 0.587 | 0.7861 |
| CQADupstackUnixRetrieval | 0.5636 | 0.5369 | 0.3988 | 0.7198 |
| ClimateFEVERHardNegatives | 0.4063 | 0.3106 | 0.26 | 0.4900 |
| FEVERHardNegatives | 0.8930 | 0.8898 | 0.8379 | 0.9453 |
| FiQA2018 | 0.5369 | 0.6178 | 0.4381 | 0.7991 |
| HotpotQAHardNegatives | 0.7164 | 0.8701 | 0.7055 | 0.8701 |
| ImdbClassification | 0.9633 | 0.9498 | 0.8867 | 0.9737 |
| MTOPDomainClassification | 0.9924 | 0.9927 | 0.9097 | 0.9995 |
| MassiveIntentClassification | 0.8612 | 0.8846 | 0.6804 | 0.9194 |
| MassiveScenarioClassification | 0.9148 | 0.9208 | 0.7178 | 0.9930 |
| MedrxivClusteringP2P.v2 | 0.6131 | 0.4716 | 0.3431 | 0.5179 |
| MedrxivClusteringS2S.v2 | 0.5934 | 0.4501 | 0.3152 | 0.5106 |
| MindSmallReranking | 0.3232 | 0.3295 | 0.3024 | 0.3437 |
| SCIDOCS | 0.2472 | 0.2515 | 0.1745 | 0.3453 |
| SICK-R | 0.8143 | 0.8275 | 0.8023 | 0.9465 |
| STS12 | 0.8070 | 0.8155 | 0.8002 | 0.9546 |
| STS13 | 0.8795 | 0.8989 | 0.8155 | 0.9776 |
| STS14 | 0.8409 | 0.8541 | 0.7772 | 0.9753 |
| STS15 | 0.8858 | 0.9044 | 0.8931 | 0.9811 |
| STS17 | 0.9032 | 0.9161 | 0.8812 | 0.9586 |
| STS22.v2 | 0.6683 | 0.6797 | 0.6366 | 0.7984 |
| STSBenchmark | 0.8736 | 0.8908 | 0.8729 | 0.9504 |
| SprintDuplicateQuestions | 0.9407 | 0.9690 | 0.9314 | 0.9838 |
| StackExchangeClustering.v2 | 0.7650 | 0.9207 | 0.4643 | 0.9207 |
| StackExchangeClusteringP2P.v2 | 0.5041 | 0.5091 | 0.3854 | 0.5510 |
| SummEvalSummarization.v2 | 0.2988 | 0.3828 | 0.3141 | 0.3893 |
| TRECCOVID | 0.6204 | 0.8631 | 0.7115 | 0.9499 |
| Touche2020Retrieval.v3 | 0.5522 | 0.5239 | 0.4959 | 0.7465 |
| ToxicConversationsClassification | 0.9036 | 0.8875 | 0.6601 | 0.9759 |
| TweetSentimentExtractionClassification | 0.7983 | 0.6988 | 0.628 | 0.8823 |
| TwentyNewsgroupsClustering.v2 | 0.6079 | 0.5737 | 0.3921 | 0.8758 |
| TwitterSemEval2015 | 0.6985 | 0.7917 | 0.7528 | 0.8946 |
| TwitterURLCorpus | 0.8589 | 0.8705 | 0.8583 | 0.9571 |
| Average | 0.7204 | 0.7330 | 0.6207 | 0.8115 |
Model have high performance on these tasks: BiorxivClusteringP2P.v2,MedrxivClusteringP2P.v2,MedrxivClusteringS2S.v2
Results for codefuse-ai/F2LLM-4B
| task_name | codefuse-ai/F2LLM-4B | google/gemini-embedding-001 | intfloat/multilingual-e5-large | Max result |
|---|---|---|---|---|
| AmazonCounterfactualClassification | 0.9350 | 0.9289 | nan | 0.9696 |
| ArXivHierarchicalClusteringP2P | 0.6560 | 0.6492 | 0.5569 | 0.6869 |
| ArXivHierarchicalClusteringS2S | 0.6450 | 0.6384 | 0.5367 | 0.6548 |
| ArguAna | 0.6193 | 0.8644 | 0.5436 | 0.8979 |
| AskUbuntuDupQuestions | 0.6707 | 0.6424 | 0.5924 | 0.7020 |
| BIOSSES | 0.8756 | 0.8897 | 0.8457 | 0.9692 |
| Banking77Classification | 0.9186 | 0.9427 | 0.7492 | 0.9427 |
| BiorxivClusteringP2P.v2 | 0.8417 | 0.5386 | 0.372 | 0.5642 |
| CQADupstackGamingRetrieval | 0.6537 | 0.7068 | 0.587 | 0.7861 |
| CQADupstackUnixRetrieval | 0.5901 | 0.5369 | 0.3988 | 0.7198 |
| ClimateFEVERHardNegatives | 0.4339 | 0.3106 | 0.26 | 0.4900 |
| FEVERHardNegatives | 0.9187 | 0.8898 | 0.8379 | 0.9453 |
| FiQA2018 | 0.5839 | 0.6178 | 0.4381 | 0.7991 |
| HotpotQAHardNegatives | 0.7311 | 0.8701 | 0.7055 | 0.8701 |
| ImdbClassification | 0.9688 | 0.9498 | 0.8867 | 0.9737 |
| MTOPDomainClassification | 0.9930 | 0.9927 | 0.9097 | 0.9995 |
| MassiveIntentClassification | 0.8784 | 0.8846 | 0.6804 | 0.9194 |
| MassiveScenarioClassification | 0.9225 | 0.9208 | 0.7178 | 0.9930 |
| MedrxivClusteringP2P.v2 | 0.7199 | 0.4716 | 0.3431 | 0.5179 |
| MedrxivClusteringS2S.v2 | 0.7023 | 0.4501 | 0.3152 | 0.5106 |
| MindSmallReranking | 0.3303 | 0.3295 | 0.3024 | 0.3437 |
| SCIDOCS | 0.2670 | 0.2515 | 0.1745 | 0.3453 |
| SICK-R | 0.8170 | 0.8275 | 0.8023 | 0.9465 |
| STS12 | 0.8164 | 0.8155 | 0.8002 | 0.9546 |
| STS13 | 0.8930 | 0.8989 | 0.8155 | 0.9776 |
| STS14 | 0.8547 | 0.8541 | 0.7772 | 0.9753 |
| STS15 | 0.8909 | 0.9044 | 0.8931 | 0.9811 |
| STS17 | 0.8931 | 0.9161 | 0.8812 | 0.9586 |
| STS22.v2 | 0.6654 | 0.6797 | 0.6366 | 0.7984 |
| STSBenchmark | 0.8723 | 0.8908 | 0.8729 | 0.9504 |
| SprintDuplicateQuestions | 0.9117 | 0.9690 | 0.9314 | 0.9838 |
| StackExchangeClustering.v2 | 0.7882 | 0.9207 | 0.4643 | 0.9207 |
| StackExchangeClusteringP2P.v2 | 0.5014 | 0.5091 | 0.3854 | 0.5510 |
| SummEvalSummarization.v2 | 0.3319 | 0.3828 | 0.3141 | 0.3893 |
| TRECCOVID | 0.6064 | 0.8631 | 0.7115 | 0.9499 |
| Touche2020Retrieval.v3 | 0.5589 | 0.5239 | 0.4959 | 0.7465 |
| ToxicConversationsClassification | 0.9202 | 0.8875 | 0.6601 | 0.9759 |
| TweetSentimentExtractionClassification | 0.8030 | 0.6988 | 0.628 | 0.8823 |
| TwentyNewsgroupsClustering.v2 | 0.6288 | 0.5737 | 0.3921 | 0.8758 |
| TwitterSemEval2015 | 0.7430 | 0.7917 | 0.7528 | 0.8946 |
| TwitterURLCorpus | 0.8580 | 0.8705 | 0.8583 | 0.9571 |
| Average | 0.7368 | 0.7330 | 0.6207 | 0.8115 |
Model have high performance on these tasks: BiorxivClusteringP2P.v2,MedrxivClusteringP2P.v2,MedrxivClusteringS2S.v2
|
Hi, the implementation has been merged into the mteb repo here. Could you please review the results? Thanks! @KennethEnevoldsen @Samoed |
|
Sorry for the late review @Geralt-Targaryen - looks good here |
Checklist
mteb/models/this can be as an API. Instruction on how to add a model can be found here