-
Notifications
You must be signed in to change notification settings - Fork 125
add Seed-1.6-embedding model results #223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add Seed-1.6-embedding model results #223
Conversation
Model Results ComparisonReference models: Results for
|
| task_name | Bytedance/Seed-1.6-embedding | google/gemini-embedding-001 | intfloat/multilingual-e5-large | Max result |
|---|---|---|---|---|
| AFQMC | 0.68 | nan | 0.33 | 0.01 |
| ATEC | 0.61 | nan | 0.4 | 0.01 |
| AmazonCounterfactualClassification | 0.94 | 0.88 | 0.7 | 0.92 |
| ArXivHierarchicalClusteringP2P | 0.63 | 0.65 | 0.56 | 0.01 |
| ArXivHierarchicalClusteringS2S | 0.62 | 0.64 | 0.54 | 0.01 |
| ArguAna | 0.71 | 0.86 | 0.54 | 0.64 |
| AskUbuntuDupQuestions | 0.68 | 0.64 | 0.59 | 0.66 |
| BIOSSES | 0.87 | 0.89 | 0.85 | 0.83 |
| BQ | 0.73 | nan | 0.48 | 0.01 |
| Banking77Classification | 0.92 | 0.94 | 0.75 | 0.89 |
| BiorxivClusteringP2P.v2 | 0.54 | 0.54 | 0.37 | 0.01 |
| CLSClusteringP2P | 0.82 | nan | nan | 0.01 |
| CLSClusteringS2S | 0.74 | nan | nan | 0.01 |
| CMedQAv1-reranking | 0.89 | nan | 0.68 | 0.01 |
| CMedQAv2-reranking | 0.90 | nan | 0.67 | 0.01 |
| CQADupstackGamingRetrieval | 0.67 | 0.71 | 0.59 | 0.01 |
| CQADupstackUnixRetrieval | 0.55 | 0.54 | 0.4 | 0.01 |
| ClimateFEVERHardNegatives | 0.47 | 0.31 | 0.26 | 0.00 |
| CmedqaRetrieval | 0.49 | nan | 0.29 | 0.01 |
| Cmnli | 0.90 | nan | nan | 0.01 |
| CovidRetrieval | 0.88 | 0.79 | 0.76 | 0.01 |
| DuRetrieval | 0.94 | nan | 0.85 | 0.01 |
| EcomRetrieval | 0.74 | nan | 0.55 | 0.01 |
| FEVERHardNegatives | 0.93 | 0.89 | 0.84 | 0.01 |
| FiQA2018 | 0.62 | 0.62 | 0.44 | 0.56 |
| HotpotQAHardNegatives | 0.84 | 0.87 | 0.71 | 0.01 |
| IFlyTek | 0.51 | nan | 0.42 | 0.01 |
| ImdbClassification | 0.97 | 0.95 | 0.89 | 0.96 |
| JDReview | 0.91 | nan | 0.81 | 0.01 |
| LCQMC | 0.80 | nan | 0.76 | 0.01 |
| MMarcoReranking | 0.40 | nan | 0.29 | 0.00 |
| MMarcoRetrieval | 0.90 | nan | 0.79 | 0.01 |
| MTOPDomainClassification | 0.99 | 0.98 | 0.9 | 0.99 |
| MassiveIntentClassification | 0.89 | 0.82 | 0.6 | 0.85 |
| MassiveScenarioClassification | 0.93 | 0.87 | 0.7 | 0.90 |
| MedicalRetrieval | 0.72 | nan | 0.51 | 0.01 |
| MedrxivClusteringP2P.v2 | 0.49 | 0.47 | 0.34 | 0.01 |
| MedrxivClusteringS2S.v2 | 0.47 | 0.45 | 0.32 | 0.01 |
| MindSmallReranking | 0.33 | 0.33 | 0.3 | 0.33 |
| MultilingualSentiment | 0.81 | nan | 0.71 | 0.01 |
| Ocnli | 0.88 | nan | nan | 0.01 |
| OnlineShopping | 0.95 | nan | 0.9 | 0.01 |
| PAWSX | 0.59 | nan | 0.15 | 0.01 |
| QBQTC | 0.57 | nan | nan | 0.01 |
| SCIDOCS | 0.25 | 0.25 | 0.17 | 0.25 |
| SICK-R | 0.85 | 0.83 | 0.8 | 0.82 |
| STS12 | 0.83 | 0.82 | 0.8 | 0.80 |
| STS13 | 0.92 | 0.90 | 0.82 | 0.89 |
| STS14 | 0.89 | 0.85 | 0.78 | 0.85 |
| STS15 | 0.92 | 0.90 | 0.89 | 0.89 |
| STS17 | 0.91 | 0.89 | 0.82 | 0.91 |
| STS22.v2 | 0.73 | 0.72 | 0.64 | 0.01 |
| STSB | 0.85 | 0.85 | 0.82 | 0.01 |
| STSBenchmark | 0.89 | 0.89 | 0.87 | 0.88 |
| SprintDuplicateQuestions | 0.90 | 0.97 | 0.93 | 0.96 |
| StackExchangeClustering.v2 | 0.80 | 0.92 | 0.46 | 0.01 |
| StackExchangeClusteringP2P.v2 | 0.52 | 0.51 | 0.39 | 0.01 |
| SummEvalSummarization.v2 | 0.37 | 0.38 | 0.31 | 0.00 |
| T2Reranking | 0.68 | 0.68 | 0.66 | 0.01 |
| T2Retrieval | 0.89 | nan | 0.76 | 0.01 |
| TNews | 0.59 | nan | 0.49 | 0.01 |
| TRECCOVID | 0.84 | 0.86 | 0.71 | 0.85 |
| ThuNewsClusteringP2P | 0.69 | nan | nan | 0.01 |
| ThuNewsClusteringS2S | 0.68 | nan | nan | 0.01 |
| Touche2020Retrieval.v3 | 0.61 | 0.52 | 0.5 | 0.01 |
| ToxicConversationsClassification | 0.94 | 0.89 | 0.66 | 0.87 |
| TweetSentimentExtractionClassification | 0.80 | 0.70 | 0.63 | 0.74 |
| TwentyNewsgroupsClustering.v2 | 0.67 | 0.57 | 0.39 | 0.01 |
| TwitterSemEval2015 | 0.78 | 0.79 | 0.75 | 0.80 |
| TwitterURLCorpus | 0.87 | 0.87 | 0.86 | 0.87 |
| VideoRetrieval | 0.81 | nan | 0.58 | 0.01 |
| Waimai | 0.91 | nan | 0.86 | 0.01 |
| Average | 0.75 | 0.73 | 0.61 | 0.28 |
|
@KennethEnevoldsen I noticed that a "review of implementation" tag has been added to this PR. May I ask how long this usually takes? What other processes are probably needed so that these results can be merged. If you need to add any extra information or have any questions, please let me know at any time. |
|
@QuanYuhan, it simply means that we are waiting for a review of the implementation (change the name to match). I just merged the implementation so will take a look at this now |
|
Potentially problematic:
@QuanYuhan, some of these scores look like the data might have been included during training (some of them are comparable to models trained on the training set). Can you double-check these? |
@KennethEnevoldsen Thank you for your reminder. After a careful check, we found that there were omissions in the training data. For the 5 datasets you've mentioned, we We have submitted a PR(embeddings-benchmark/mteb#2857) to update the model information. If you have any other questions, feel free to contact me at any time. |
|
@KennethEnevoldsen Please have a look. Can the results be merged? Thank you. |
Checklist
mteb/models/this can be as an API. Instruction on how to add a model can be found here