Skip to content

Conversation

@annamodels
Copy link
Contributor

Submit results

Checklist

  • My model has a model sheet, report or similar
  • My model has a reference implementation in mteb/models/ this can be as an API. Instruction on how to add a model can be found here
  • The results submitted is obtained using the reference implementation
  • My model is available, either as a publicly accessible API or publicly on e.g., Huggingface
  • I solemnly swear that for all results submitted I have not on the evaluation dataset including training splits. If I have I have disclosed it clearly.

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like these weren't run with the submitted implementation? (at least the metadata does not match)

Results for annamodels/LGAI-Embedding-Preview

task_name annamodels/LGAI-Embedding-Preview google/gemini-embedding-001 intfloat/multilingual-e5-large
AmazonCounterfactualClassification 0.93 0.88 0.7
ArXivHierarchicalClusteringP2P 0.66 0.65 0.56
ArXivHierarchicalClusteringS2S 0.64 0.64 0.54
ArguAna 0.87 0.86 0.54
AskUbuntuDupQuestions 0.66 0.64 0.59
BIOSSES 0.86 0.89 0.85
Banking77Classification 0.91 0.94 0.75
BiorxivClusteringP2P.v2 0.54 0.54 0.37
CQADupstackGamingRetrieval 0.70 0.71 0.59
CQADupstackUnixRetrieval 0.56 0.54 0.4
ClimateFEVERHardNegatives 0.42 0.31 0.26
FEVERHardNegatives 0.93 0.89 0.84
FiQA2018 0.61 0.62 0.44
HotpotQAHardNegatives 0.76 0.87 0.71
ImdbClassification 0.97 0.95 0.89
MTOPDomainClassification 0.98 0.98 0.9
MassiveIntentClassification 0.82 0.82 0.6
MassiveScenarioClassification 0.85 0.87 0.7
MedrxivClusteringP2P.v2 0.47 0.47 0.34
MedrxivClusteringS2S.v2 0.48 0.45 0.32
MindSmallReranking 0.33 0.33 0.3
SCIDOCS 0.27 0.25 0.17
SICK-R 0.85 0.83 0.8
STS12 0.82 0.82 0.8
STS13 0.90 0.90 0.82
STS14 0.88 0.85 0.78
STS15 0.92 0.90 0.89
STS17 0.90 0.89 0.82
STS22.v2 0.75 0.72 0.64
STSBenchmark 0.91 0.89 0.87
SprintDuplicateQuestions 0.97 0.97 0.93
StackExchangeClustering.v2 0.79 0.92 0.46
StackExchangeClusteringP2P.v2 0.49 0.51 0.39
SummEvalSummarization.v2 0.39 0.38 0.31
TRECCOVID 0.90 0.86 0.71
Touche2020Retrieval.v3 0.59 0.52 0.5
ToxicConversationsClassification 0.93 0.89 0.66
TweetSentimentExtractionClassification 0.80 0.70 0.63
TwentyNewsgroupsClustering.v2 0.68 0.57 0.39
TwitterSemEval2015 0.80 0.79 0.75
TwitterURLCorpus 0.88 0.87 0.86
Average 0.74 0.73 0.62
task_name ByteDance-Seed/Seed1.5-Embedding annamodels/LGAI-Embedding-Preview
AmazonCounterfactualClassification 0.91 0.93
ArXivHierarchicalClusteringP2P 0.65 0.66
ArXivHierarchicalClusteringS2S 0.64 0.64
ArguAna 0.75 0.87
AskUbuntuDupQuestions 0.69 0.66
BIOSSES 0.83 0.86
Banking77Classification 0.92 0.91
BiorxivClusteringP2P.v2 0.56 0.54
CQADupstackGamingRetrieval 0.72 0.70
CQADupstackUnixRetrieval 0.57 0.56
ClimateFEVERHardNegatives 0.48 0.42
FEVERHardNegatives 0.95 0.93
FiQA2018 0.65 0.61
HotpotQAHardNegatives 0.86 0.76
ImdbClassification 0.97 0.97
MTOPDomainClassification 0.99 0.98
MassiveIntentClassification 0.87 0.82
MassiveScenarioClassification 0.93 0.85
MedrxivClusteringP2P.v2 0.52 0.47
MedrxivClusteringS2S.v2 0.51 0.48
MindSmallReranking 0.33 0.33
SCIDOCS 0.26 0.27
SICK-R 0.84 0.85
STS12 0.85 0.82
STS13 0.93 0.90
STS14 0.90 0.88
STS15 0.92 0.92
STS17 0.93 0.90
STS22.v2 0.73 0.75
STSBenchmark 0.92 0.91
SprintDuplicateQuestions 0.97 0.97
StackExchangeClustering.v2 0.81 0.79
StackExchangeClusteringP2P.v2 0.53 0.49
SummEvalSummarization.v2 0.36 0.39
TRECCOVID 0.88 0.90
Touche2020Retrieval.v3 0.64 0.59
ToxicConversationsClassification 0.87 0.93
TweetSentimentExtractionClassification 0.72 0.80
TwentyNewsgroupsClustering.v2 0.65 0.68
TwitterSemEval2015 0.78 0.80
TwitterURLCorpus 0.87 0.88
Average 0.75 0.74

All of the suspiciously high scores are due to them being non-zero-shot.

@annamodels
Copy link
Contributor Author

annamodels commented Jun 12, 2025

@KennethEnevoldsen Thanks for your reviewing.
When implementing our model, we applied the following methodologies, which are described in detail in our technical report (https://arxiv.org/pdf/2506.07438). (Note that the original model name was LG-ANNA-Embedding, but it was recently changed to LGAI-Embedding-Preview, so the report title will also be updated accordingly.)

  • Knowledge distillation through soft labeling (Section 4.1)
  • Instruction tuning using in-task examples (query + positive) during training (Section 4.2)
  • Few-shot examples added during inference as well (as noted on the HuggingFace model page)
  • Sophisticated hard-negative mining techniques (Section 4.3)
  • Converting NLI datasets into STS-style format (Section 3 - Data Conversion)

If there's anything we need to adjust to have our model listed on the leaderboard, could you kindly provide a clear guide? We’d be happy to revise accordingly based on your instructions.

@KennethEnevoldsen
Copy link
Contributor

Sorry, this wasn't clear. It seems like these weren't obtained using the submitted implementation. At least the metadata file does not match. Simply rerunning it with the submitted implementation should solve this.

@annamodels
Copy link
Contributor Author

annamodels commented Jun 12, 2025

@KennethEnevoldsen Thanks for your response.
First, I’d like to clarify that the scores we submitted were obtained using inference with our model.
Could you please explain in more detail what you mean by “the submitted implementation”? It would be helpful to better understand what exactly is expected.
Also, could you elaborate on the part where you mentioned at least the metadata file does not match? A bit more context would be appreciated.
If there are any specific references or parts we should follow, we’d really appreciate it if you could guide us.

@KennethEnevoldsen
Copy link
Contributor

First, I’d like to clarify that the scores we submitted were obtained using inference with our model.

Thanks for the confirmation.

The problem is that the submitted file model_meta.json does not align with the ModelMeta (e.g. reference is None).

It might be that you hadn't added all the metadata beforehand. If that is the case, you can just recreate the model_meta.json by deleting it and running a task.

@annamodels
Copy link
Contributor Author

@KennethEnevoldsen Thanks so much for your kind response — I really appreciate it.

During inference, we simply evaluated the model locally without specifying any metadata, which is likely why model_meta.json ended up with missing fields like reference.

Even when we load our model from Hugging Face for evaluation, the reference field in model_meta.json still appears as null.

Would it be possible for you to kindly guide us on the correct way to input or configure the model metadata? We’d be very grateful for any instructions or examples you could share.

@KennethEnevoldsen
Copy link
Contributor

You should be able to do it using:

meta= mteb.get_model_meta(name)
model = meta.load_model() # load the model using the specified implementation

# evaluate with mteb
eval = mteb.MTEB(tasks = tasks)
results = eval.run(model, ...)

@annamodels
Copy link
Contributor Author

annamodels commented Jun 13, 2025

Following your instructions, I’ve found that the model_meta.json file now correctly aligns with the ModelMeta information.

Just to clarify, the reason the initial model_meta.json we submitted had none values for some fields was because we hadn’t yet submitted the model to GitHub. At that time, we were running inference using our own evaluation script on a local directory where the model was stored — which is why some metadata fields were left as none.

Now that the metadata is correctly aligned and updated, is there anything else we should double-check to ensure our model can be listed on the leaderboard?

@KennethEnevoldsen
Copy link
Contributor

Perfect, thanks for taking the time on this! I have enabled auto-merge on this PR (so it should be on the leaderboard by tomorrows update)

@KennethEnevoldsen KennethEnevoldsen enabled auto-merge (squash) June 15, 2025 16:49
@KennethEnevoldsen KennethEnevoldsen merged commit 137ba81 into embeddings-benchmark:main Jun 15, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants