Skip to content

Conversation

@QuanYuhan
Copy link
Contributor

Checklist

  • My model has a model sheet, report or similar
  • [] My model has a reference implementation in mteb/models/ this can be as an API. Instruction on how to add a model can be found here
  • The results submitted is obtained using the reference implementation
  • My model is available, either as a publicly accessible API or publicly on e.g., Huggingface
  • I solemnly swear that for all results submitted I have not on the evaluation dataset including training splits. If I have I have disclosed it clearly.

@github-actions
Copy link

Model Results Comparison

Reference models: intfloat/multilingual-e5-large, google/gemini-embedding-001
New models evaluated: Bytedance/Seed-1.6-embedding
Tasks: AFQMC, ATEC, AmazonCounterfactualClassification, ArXivHierarchicalClusteringP2P, ArXivHierarchicalClusteringS2S, ArguAna, AskUbuntuDupQuestions, BIOSSES, BQ, Banking77Classification, BiorxivClusteringP2P.v2, CLSClusteringP2P, CLSClusteringS2S, CMedQAv1-reranking, CMedQAv2-reranking, CQADupstackGamingRetrieval, CQADupstackUnixRetrieval, ClimateFEVERHardNegatives, CmedqaRetrieval, Cmnli, CovidRetrieval, DuRetrieval, EcomRetrieval, FEVERHardNegatives, FiQA2018, HotpotQAHardNegatives, IFlyTek, ImdbClassification, JDReview, LCQMC, MMarcoReranking, MMarcoRetrieval, MTOPDomainClassification, MassiveIntentClassification, MassiveScenarioClassification, MedicalRetrieval, MedrxivClusteringP2P.v2, MedrxivClusteringS2S.v2, MindSmallReranking, MultilingualSentiment, Ocnli, OnlineShopping, PAWSX, QBQTC, SCIDOCS, SICK-R, STS12, STS13, STS14, STS15, STS17, STS22.v2, STSB, STSBenchmark, SprintDuplicateQuestions, StackExchangeClustering.v2, StackExchangeClusteringP2P.v2, SummEvalSummarization.v2, T2Reranking, T2Retrieval, TNews, TRECCOVID, ThuNewsClusteringP2P, ThuNewsClusteringS2S, Touche2020Retrieval.v3, ToxicConversationsClassification, TweetSentimentExtractionClassification, TwentyNewsgroupsClustering.v2, TwitterSemEval2015, TwitterURLCorpus, VideoRetrieval, Waimai

Results for Bytedance/Seed-1.6-embedding

task_name Bytedance/Seed-1.6-embedding google/gemini-embedding-001 intfloat/multilingual-e5-large Max result
AFQMC 0.68 nan 0.33 0.01
ATEC 0.61 nan 0.4 0.01
AmazonCounterfactualClassification 0.94 0.88 0.7 0.92
ArXivHierarchicalClusteringP2P 0.63 0.65 0.56 0.01
ArXivHierarchicalClusteringS2S 0.62 0.64 0.54 0.01
ArguAna 0.71 0.86 0.54 0.64
AskUbuntuDupQuestions 0.68 0.64 0.59 0.66
BIOSSES 0.87 0.89 0.85 0.83
BQ 0.73 nan 0.48 0.01
Banking77Classification 0.92 0.94 0.75 0.89
BiorxivClusteringP2P.v2 0.54 0.54 0.37 0.01
CLSClusteringP2P 0.82 nan nan 0.01
CLSClusteringS2S 0.74 nan nan 0.01
CMedQAv1-reranking 0.89 nan 0.68 0.01
CMedQAv2-reranking 0.90 nan 0.67 0.01
CQADupstackGamingRetrieval 0.67 0.71 0.59 0.01
CQADupstackUnixRetrieval 0.55 0.54 0.4 0.01
ClimateFEVERHardNegatives 0.47 0.31 0.26 0.00
CmedqaRetrieval 0.49 nan 0.29 0.01
Cmnli 0.90 nan nan 0.01
CovidRetrieval 0.88 0.79 0.76 0.01
DuRetrieval 0.94 nan 0.85 0.01
EcomRetrieval 0.74 nan 0.55 0.01
FEVERHardNegatives 0.93 0.89 0.84 0.01
FiQA2018 0.62 0.62 0.44 0.56
HotpotQAHardNegatives 0.84 0.87 0.71 0.01
IFlyTek 0.51 nan 0.42 0.01
ImdbClassification 0.97 0.95 0.89 0.96
JDReview 0.91 nan 0.81 0.01
LCQMC 0.80 nan 0.76 0.01
MMarcoReranking 0.40 nan 0.29 0.00
MMarcoRetrieval 0.90 nan 0.79 0.01
MTOPDomainClassification 0.99 0.98 0.9 0.99
MassiveIntentClassification 0.89 0.82 0.6 0.85
MassiveScenarioClassification 0.93 0.87 0.7 0.90
MedicalRetrieval 0.72 nan 0.51 0.01
MedrxivClusteringP2P.v2 0.49 0.47 0.34 0.01
MedrxivClusteringS2S.v2 0.47 0.45 0.32 0.01
MindSmallReranking 0.33 0.33 0.3 0.33
MultilingualSentiment 0.81 nan 0.71 0.01
Ocnli 0.88 nan nan 0.01
OnlineShopping 0.95 nan 0.9 0.01
PAWSX 0.59 nan 0.15 0.01
QBQTC 0.57 nan nan 0.01
SCIDOCS 0.25 0.25 0.17 0.25
SICK-R 0.85 0.83 0.8 0.82
STS12 0.83 0.82 0.8 0.80
STS13 0.92 0.90 0.82 0.89
STS14 0.89 0.85 0.78 0.85
STS15 0.92 0.90 0.89 0.89
STS17 0.91 0.89 0.82 0.91
STS22.v2 0.73 0.72 0.64 0.01
STSB 0.85 0.85 0.82 0.01
STSBenchmark 0.89 0.89 0.87 0.88
SprintDuplicateQuestions 0.90 0.97 0.93 0.96
StackExchangeClustering.v2 0.80 0.92 0.46 0.01
StackExchangeClusteringP2P.v2 0.52 0.51 0.39 0.01
SummEvalSummarization.v2 0.37 0.38 0.31 0.00
T2Reranking 0.68 0.68 0.66 0.01
T2Retrieval 0.89 nan 0.76 0.01
TNews 0.59 nan 0.49 0.01
TRECCOVID 0.84 0.86 0.71 0.85
ThuNewsClusteringP2P 0.69 nan nan 0.01
ThuNewsClusteringS2S 0.68 nan nan 0.01
Touche2020Retrieval.v3 0.61 0.52 0.5 0.01
ToxicConversationsClassification 0.94 0.89 0.66 0.87
TweetSentimentExtractionClassification 0.80 0.70 0.63 0.74
TwentyNewsgroupsClustering.v2 0.67 0.57 0.39 0.01
TwitterSemEval2015 0.78 0.79 0.75 0.80
TwitterURLCorpus 0.87 0.87 0.86 0.87
VideoRetrieval 0.81 nan 0.58 0.01
Waimai 0.91 nan 0.86 0.01
Average 0.75 0.73 0.61 0.28

@KennethEnevoldsen KennethEnevoldsen added the waiting for review of implementation This PR is waiting for an implementation review before merging the results. label Jun 24, 2025
@QuanYuhan
Copy link
Contributor Author

@KennethEnevoldsen I noticed that a "review of implementation" tag has been added to this PR. May I ask how long this usually takes? What other processes are probably needed so that these results can be merged. If you need to add any extra information or have any questions, please let me know at any time.

@KennethEnevoldsen KennethEnevoldsen removed the waiting for review of implementation This PR is waiting for an implementation review before merging the results. label Jun 25, 2025
@KennethEnevoldsen
Copy link
Contributor

@QuanYuhan, it simply means that we are waiting for a review of the implementation (change the name to match).

I just merged the implementation so will take a look at this now

@KennethEnevoldsen
Copy link
Contributor

Potentially problematic:

  • TwentyNewsgroupsClustering.v2
  • AmazonCounterfactualClassification
  • ClimateFEVERHardNegatives
  • TweetSentimentExtractionClassification
  • MassiveScenarioClassification

@QuanYuhan, some of these scores look like the data might have been included during training (some of them are comparable to models trained on the training set). Can you double-check these?

@QuanYuhan
Copy link
Contributor Author

QuanYuhan commented Jun 25, 2025

Potentially problematic:

  • TwentyNewsgroupsClustering.v2
  • AmazonCounterfactualClassification
  • ClimateFEVERHardNegatives
  • TweetSentimentExtractionClassification
  • MassiveScenarioClassification

@QuanYuhan, some of these scores look like the data might have been included during training (some of them are comparable to models trained on the training set). Can you double-check these?

@KennethEnevoldsen Thank you for your reminder.

After a careful check, we found that there were omissions in the training data. For the 5 datasets you've mentioned, we
didn't use ClimateFEVERHardNegatives training sets, for other datasets, we used the training set. Meanwhile, we have
supplemented some other training set information.

We have submitted a PR(embeddings-benchmark/mteb#2857) to update the model information. If you have any other questions, feel free to contact me at any time.

@QuanYuhan
Copy link
Contributor Author

@KennethEnevoldsen Please have a look. Can the results be merged? Thank you.

@Samoed Samoed merged commit 94040ba into embeddings-benchmark:main Jun 27, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants