Skip to content

Conversation

@manveertamber
Copy link
Contributor

Checklist

  • My model has a model sheet, report or similar
  • My model has a reference implementation in mteb/models/ this can be as an API. Instruction on how to add a model can be found here
    • No, but there is an existing PR ___
  • The results submitted is obtained using the reference implementation
  • My model is available, either as a publicly accessible API or publicly on e.g., Huggingface
  • I solemnly swear that for all results submitted I have not training on the dataset including the training set. If I have I have disclosed it clearly.

@KennethEnevoldsen
Copy link
Contributor

ref to model implementation PR:
embeddings-benchmark/mteb#2727

@KennethEnevoldsen
Copy link
Contributor

KennethEnevoldsen commented May 26, 2025

Congratulations on the release and great paper will def. give that a closer read!

Here is an overview of the results compared to existing models:

task_name intfloat/e5-base-v2 intfloat/e5-large-v2 intfloat/multilingual-e5-large-instruct manveertamber/cadet-embed-base-v1
AILACasedocs 0.27 0.31 0.33 0.30
AILAStatutes 0.2 0.18 0.30 0.24
ARCChallenge 0.1 0.11 0.15 0.12
AlphaNLI 0.22 0.15 0.25 0.32
AppsRetrieval 0.12 0.14 0.35 0.09
ArguAna 0.45 0.46 0.58 0.56
BrightLongRetrieval nan nan nan 0.34
BrightRetrieval nan nan nan 0.15
BuiltBenchRetrieval 0.61 0.65 0.65 0.65
CQADupstackAndroidRetrieval 0.5 0.50 0.55 0.48
CQADupstackEnglishRetrieval 0.42 0.44 0.49 0.49
CQADupstackGamingRetrieval 0.56 0.58 0.64 0.58
CQADupstackGisRetrieval 0.35 0.36 0.41 0.39
CQADupstackMathematicaRetrieval 0.28 0.29 0.31 0.30
CQADupstackPhysicsRetrieval 0.41 0.39 0.49 0.44
CQADupstackProgrammersRetrieval 0.37 0.38 0.47 0.41
CQADupstackRetrieval 0.39 0.38 0.44 0.41
CQADupstackStatsRetrieval 0.33 0.33 0.39 0.35
CQADupstackTexRetrieval 0.27 0.27 0.32 0.30
CQADupstackUnixRetrieval 0.37 0.39 0.45 0.40
CQADupstackWebmastersRetrieval 0.38 0.38 0.45 0.40
CQADupstackWordpressRetrieval 0.31 0.32 0.35 0.34
ChemHotpotQARetrieval 0.83 0.84 nan 0.87
ChemNQRetrieval 0.62 0.69 nan 0.67
ClimateFEVER 0.27 0.22 0.30 0.37
ClimateFEVER.v2 nan nan nan 0.30
ClimateFEVERHardNegatives 0.27 0.23 0.24 0.37
CodeFeedbackMT 0.42 0.48 0.40 0.45
CodeFeedbackST 0.75 0.76 0.76 0.75
CosQA 0.33 0.32 0.38 0.32
DBPedia 0.42 0.44 0.38 0.45
DBPediaHardNegatives nan nan 0.38 0.48
FEVER 0.85 0.83 0.78 0.89
FEVERHardNegatives 0.85 0.83 0.76 0.89
FaithDial nan nan 0.24 0.23
FeedbackQARetrieval nan nan 0.55 0.56
FiQA2018 0.4 0.41 0.48 0.41
HagridRetrieval 0.99 0.99 0.99 0.99
HellaSwag 0.25 0.28 0.32 0.30
HotpotQA 0.69 0.73 0.69 0.74
HotpotQAHardNegatives 0.69 0.73 0.65 0.74
LEMBNarrativeQARetrieval 0.25 0.26 0.27 0.25
LEMBNeedleRetrieval 0.29 0.32 0.29 0.28
LEMBPasskeyRetrieval 0.38 0.39 0.38 0.38
LEMBQMSumRetrieval 0.24 0.25 0.26 0.26
LEMBSummScreenFDRetrieval 0.75 0.77 0.73 0.77
LEMBWikimQARetrieval 0.56 0.58 0.58 0.58
LegalBenchConsumerContractsQA 0.72 0.77 0.77 0.77
LegalBenchCorporateLobbying 0.92 0.91 0.94 0.93
LegalSummarization 0.59 0.60 0.68 0.64
LitSearchRetrieval nan nan nan 0.48
MLQuestions nan nan 0.60 0.63
MSMARCO 0.42 0.43 0.40 0.43
MSMARCOHardNegatives nan nan 0.67 0.73
MedicalQARetrieval 0.69 0.70 0.71 0.71
NFCorpus 0.35 0.37 0.36 0.38
NQ 0.58 0.63 0.58 0.59
QuoraRetrieval 0.87 0.87 0.89 0.88
SCIDOCS 0.19 0.20 0.19 0.19
SciFact 0.72 0.72 0.72 0.75
StackOverflowQA 0.88 0.90 0.86 0.86
SyntheticText2SQL 0.52 0.50 0.59 0.56
TRECCOVID 0.7 0.67 0.83 0.81
Touche2020 0.26 0.21 0.27 0.30
Average 0.48 0.49 0.50 0.50

Overall I don't see too many issues.
FEVER scores looks slightly inflated given the size. Which is likely due to the training on FEVER derived data.

@manveertamber seems like there are still quite a few scores missing from the main English leaderboard (MTEB(eng, v2)) introduced in MMTEB. Do you want to add these as well? (just a recommendation, we can merge as is as well)

task_name intfloat/e5-base-v2 intfloat/e5-large-v2 intfloat/multilingual-e5-large-instruct manveertamber/cadet-embed-base-v1
AmazonCounterfactualClassification 0.76 0.78 0.70 nan
ArXivHierarchicalClusteringP2P 0.58 0.58 0.63 nan
ArXivHierarchicalClusteringS2S 0.55 0.55 0.61 nan
ArguAna 0.45 0.46 0.58 0.56
AskUbuntuDupQuestions 0.59 0.60 0.64 nan
BIOSSES 0.81 0.84 0.87 nan
Banking77Classification 0.84 0.85 0.78 nan
BiorxivClusteringP2P.v2 0.39 0.40 0.43 nan
CQADupstackGamingRetrieval 0.56 0.58 0.64 0.58
CQADupstackUnixRetrieval 0.37 0.39 0.45 0.40
ClimateFEVERHardNegatives 0.27 0.23 0.24 0.37
FEVERHardNegatives 0.85 0.83 0.76 0.89
FiQA2018 0.40 0.41 0.48 0.41
HotpotQAHardNegatives 0.69 0.73 0.65 0.74
ImdbClassification 0.86 0.92 0.95 nan
MTOPDomainClassification 0.92 0.93 0.91 nan
MassiveIntentClassification 0.67 0.68 0.71 nan
MassiveScenarioClassification 0.73 0.71 0.74 nan
MedrxivClusteringP2P.v2 0.36 0.35 0.38 nan
MedrxivClusteringS2S.v2 0.33 0.34 0.38 nan
MindSmallReranking 0.31 0.32 0.33 nan
SCIDOCS 0.19 0.20 0.19 0.19
SICK-R 0.78 0.79 0.82 nan
STS12 0.73 0.74 0.83 nan
STS13 0.83 0.81 0.88 nan
STS14 0.80 0.79 0.85 nan
STS15 0.88 0.88 0.91 nan
STS17 0.89 0.90 0.90 nan
STS22.v2 0.67 0.67 0.68 nan
STSBenchmark 0.85 0.85 0.88 nan
SprintDuplicateQuestions 0.94 0.95 0.92 nan
StackExchangeClustering.v2 0.53 0.52 0.60 nan
StackExchangeClusteringP2P.v2 0.40 0.40 0.46 nan
SummEvalSummarization.v2 0.34 0.32 0.30 nan
TRECCOVID 0.70 0.67 0.83 0.81
Touche2020Retrieval.v3 0.50 0.42 0.53 nan
ToxicConversationsClassification 0.66 0.63 0.67 nan
TweetSentimentExtractionClassification 0.60 0.61 0.59 nan
TwentyNewsgroupsClustering.v2 0.48 0.48 0.51 nan
TwitterSemEval2015 0.76 0.77 0.80 nan
TwitterURLCorpus 0.87 0.86 0.87 nan
Average 0.63 0.63 0.66 0.55

@manveertamber
Copy link
Contributor Author

Hi @KennethEnevoldsen, thanks for the kind words!

Can we merge for now? At the moment I'm not too concerned with these additional experiments and the model isn't fine-tuned for clustering/classification in particular anyway.

@KennethEnevoldsen KennethEnevoldsen merged commit c0c7ead into embeddings-benchmark:main May 29, 2025
2 checks passed
@manveertamber
Copy link
Contributor Author

Hi @KennethEnevoldsen, were these results files deleted? I can't seem to find them anymore and I'm not sure what happened.

@manveertamber manveertamber mentioned this pull request Jun 10, 2025
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants