-
Notifications
You must be signed in to change notification settings - Fork 125
Update LGAI-Embedding results #219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Submit results
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like these weren't run with the submitted implementation? (at least the metadata does not match)
Results for annamodels/LGAI-Embedding-Preview
| task_name | annamodels/LGAI-Embedding-Preview | google/gemini-embedding-001 | intfloat/multilingual-e5-large |
|---|---|---|---|
| AmazonCounterfactualClassification | 0.93 | 0.88 | 0.7 |
| ArXivHierarchicalClusteringP2P | 0.66 | 0.65 | 0.56 |
| ArXivHierarchicalClusteringS2S | 0.64 | 0.64 | 0.54 |
| ArguAna | 0.87 | 0.86 | 0.54 |
| AskUbuntuDupQuestions | 0.66 | 0.64 | 0.59 |
| BIOSSES | 0.86 | 0.89 | 0.85 |
| Banking77Classification | 0.91 | 0.94 | 0.75 |
| BiorxivClusteringP2P.v2 | 0.54 | 0.54 | 0.37 |
| CQADupstackGamingRetrieval | 0.70 | 0.71 | 0.59 |
| CQADupstackUnixRetrieval | 0.56 | 0.54 | 0.4 |
| ClimateFEVERHardNegatives | 0.42 | 0.31 | 0.26 |
| FEVERHardNegatives | 0.93 | 0.89 | 0.84 |
| FiQA2018 | 0.61 | 0.62 | 0.44 |
| HotpotQAHardNegatives | 0.76 | 0.87 | 0.71 |
| ImdbClassification | 0.97 | 0.95 | 0.89 |
| MTOPDomainClassification | 0.98 | 0.98 | 0.9 |
| MassiveIntentClassification | 0.82 | 0.82 | 0.6 |
| MassiveScenarioClassification | 0.85 | 0.87 | 0.7 |
| MedrxivClusteringP2P.v2 | 0.47 | 0.47 | 0.34 |
| MedrxivClusteringS2S.v2 | 0.48 | 0.45 | 0.32 |
| MindSmallReranking | 0.33 | 0.33 | 0.3 |
| SCIDOCS | 0.27 | 0.25 | 0.17 |
| SICK-R | 0.85 | 0.83 | 0.8 |
| STS12 | 0.82 | 0.82 | 0.8 |
| STS13 | 0.90 | 0.90 | 0.82 |
| STS14 | 0.88 | 0.85 | 0.78 |
| STS15 | 0.92 | 0.90 | 0.89 |
| STS17 | 0.90 | 0.89 | 0.82 |
| STS22.v2 | 0.75 | 0.72 | 0.64 |
| STSBenchmark | 0.91 | 0.89 | 0.87 |
| SprintDuplicateQuestions | 0.97 | 0.97 | 0.93 |
| StackExchangeClustering.v2 | 0.79 | 0.92 | 0.46 |
| StackExchangeClusteringP2P.v2 | 0.49 | 0.51 | 0.39 |
| SummEvalSummarization.v2 | 0.39 | 0.38 | 0.31 |
| TRECCOVID | 0.90 | 0.86 | 0.71 |
| Touche2020Retrieval.v3 | 0.59 | 0.52 | 0.5 |
| ToxicConversationsClassification | 0.93 | 0.89 | 0.66 |
| TweetSentimentExtractionClassification | 0.80 | 0.70 | 0.63 |
| TwentyNewsgroupsClustering.v2 | 0.68 | 0.57 | 0.39 |
| TwitterSemEval2015 | 0.80 | 0.79 | 0.75 |
| TwitterURLCorpus | 0.88 | 0.87 | 0.86 |
| Average | 0.74 | 0.73 | 0.62 |
| task_name | ByteDance-Seed/Seed1.5-Embedding | annamodels/LGAI-Embedding-Preview |
|---|---|---|
| AmazonCounterfactualClassification | 0.91 | 0.93 |
| ArXivHierarchicalClusteringP2P | 0.65 | 0.66 |
| ArXivHierarchicalClusteringS2S | 0.64 | 0.64 |
| ArguAna | 0.75 | 0.87 |
| AskUbuntuDupQuestions | 0.69 | 0.66 |
| BIOSSES | 0.83 | 0.86 |
| Banking77Classification | 0.92 | 0.91 |
| BiorxivClusteringP2P.v2 | 0.56 | 0.54 |
| CQADupstackGamingRetrieval | 0.72 | 0.70 |
| CQADupstackUnixRetrieval | 0.57 | 0.56 |
| ClimateFEVERHardNegatives | 0.48 | 0.42 |
| FEVERHardNegatives | 0.95 | 0.93 |
| FiQA2018 | 0.65 | 0.61 |
| HotpotQAHardNegatives | 0.86 | 0.76 |
| ImdbClassification | 0.97 | 0.97 |
| MTOPDomainClassification | 0.99 | 0.98 |
| MassiveIntentClassification | 0.87 | 0.82 |
| MassiveScenarioClassification | 0.93 | 0.85 |
| MedrxivClusteringP2P.v2 | 0.52 | 0.47 |
| MedrxivClusteringS2S.v2 | 0.51 | 0.48 |
| MindSmallReranking | 0.33 | 0.33 |
| SCIDOCS | 0.26 | 0.27 |
| SICK-R | 0.84 | 0.85 |
| STS12 | 0.85 | 0.82 |
| STS13 | 0.93 | 0.90 |
| STS14 | 0.90 | 0.88 |
| STS15 | 0.92 | 0.92 |
| STS17 | 0.93 | 0.90 |
| STS22.v2 | 0.73 | 0.75 |
| STSBenchmark | 0.92 | 0.91 |
| SprintDuplicateQuestions | 0.97 | 0.97 |
| StackExchangeClustering.v2 | 0.81 | 0.79 |
| StackExchangeClusteringP2P.v2 | 0.53 | 0.49 |
| SummEvalSummarization.v2 | 0.36 | 0.39 |
| TRECCOVID | 0.88 | 0.90 |
| Touche2020Retrieval.v3 | 0.64 | 0.59 |
| ToxicConversationsClassification | 0.87 | 0.93 |
| TweetSentimentExtractionClassification | 0.72 | 0.80 |
| TwentyNewsgroupsClustering.v2 | 0.65 | 0.68 |
| TwitterSemEval2015 | 0.78 | 0.80 |
| TwitterURLCorpus | 0.87 | 0.88 |
| Average | 0.75 | 0.74 |
All of the suspiciously high scores are due to them being non-zero-shot.
|
@KennethEnevoldsen Thanks for your reviewing.
If there's anything we need to adjust to have our model listed on the leaderboard, could you kindly provide a clear guide? We’d be happy to revise accordingly based on your instructions. |
|
Sorry, this wasn't clear. It seems like these weren't obtained using the submitted implementation. At least the metadata file does not match. Simply rerunning it with the submitted implementation should solve this. |
|
@KennethEnevoldsen Thanks for your response. |
Thanks for the confirmation. The problem is that the submitted file It might be that you hadn't added all the metadata beforehand. If that is the case, you can just recreate the |
|
@KennethEnevoldsen Thanks so much for your kind response — I really appreciate it. During inference, we simply evaluated the model locally without specifying any metadata, which is likely why model_meta.json ended up with missing fields like reference. Even when we load our model from Hugging Face for evaluation, the reference field in model_meta.json still appears as null. Would it be possible for you to kindly guide us on the correct way to input or configure the model metadata? We’d be very grateful for any instructions or examples you could share. |
|
You should be able to do it using: meta= mteb.get_model_meta(name)
model = meta.load_model() # load the model using the specified implementation
# evaluate with mteb
eval = mteb.MTEB(tasks = tasks)
results = eval.run(model, ...) |
|
Following your instructions, I’ve found that the model_meta.json file now correctly aligns with the ModelMeta information. Just to clarify, the reason the initial model_meta.json we submitted had none values for some fields was because we hadn’t yet submitted the model to GitHub. At that time, we were running inference using our own evaluation script on a local directory where the model was stored — which is why some metadata fields were left as none. Now that the metadata is correctly aligned and updated, is there anything else we should double-check to ensure our model can be listed on the leaderboard? |
updated
|
Perfect, thanks for taking the time on this! I have enabled auto-merge on this PR (so it should be on the leaderboard by tomorrows update) |
Submit results
Checklist
mteb/models/this can be as an API. Instruction on how to add a model can be found here