Skip to content

Add jina, uae, stella models#1319

Merged
KennethEnevoldsen merged 19 commits intoembeddings-benchmark:mainfrom
Samoed:add_jina_models
Oct 30, 2024
Merged

Add jina, uae, stella models#1319
KennethEnevoldsen merged 19 commits intoembeddings-benchmark:mainfrom
Samoed:add_jina_models

Conversation

@Samoed
Copy link
Member

@Samoed Samoed commented Oct 24, 2024

Checklist

  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.

Adding a model checklist

  • I have filled out the ModelMeta object to the extent possible
  • I have ensured that my model can be loaded using
    • mteb.get_model(model_name, revision) and
    • mteb.get_model_meta(model_name, revision)
  • I have tested the implementation works on a representative set of tasks.

@Samoed Samoed mentioned this pull request Oct 24, 2024
15 tasks
Co-authored-by: Wang Bo <bo.wang@jina.ai>
Co-authored-by: Wang Bo <bo.wang@jina.ai>
@bwanglzu
Copy link

the rest looks good to me, need to run some check to make sure different task adapters (especially retrieval), task and prompt_name is correctly passed and can reproduce our reported results, i'm running some small testing

@Samoed
Copy link
Member Author

Samoed commented Oct 24, 2024

I have results. I will paste them soon (currently creating table to compare easily and for jina they are same). @bwanglzu Thank you very much!

@Samoed
Copy link
Member Author

Samoed commented Oct 24, 2024

Results Summary:

  1. I was able to reproduce the results for jina-embeddings-v3, except for EmotionClassification. Overall, the results seem consistent.
  2. For UAE-Large-V1, the results are close but differ for ToxicConversationsClassification.
  3. I couldn't reproduce the results for stella. I'm considering removing it from this PR. I've opened an issue on HF regarding this: link. I think this is because they have different dimension, but I'm not sure. (UPD. Rerun as GritLM model, results much better. I forgot that instruct model)
    @bwanglzu

Full results

Classification

model name AmazonCounterfactualClassification (en) EmotionClassification ToxicConvesationsClassification
jina-embeddings-v3 (leaderboard) 89.49 73.3 91.29
jina-embeddings-v3 89.34 77.26 91.25
UAE-Large-V1 (leaderboard) 75.55 51.75 71.09
UAE-Large-V1 74.77 51.75 66.93
stella_en_400M_v5 (leaderboard) 92.36 78.77 89.94
stella_en_400M_v5 91.76 81.44 88.11

Clustering

model name ArxivClusteringS2S RedditClustering
jina-embeddings-v3 (leaderboard) 39.27 55.4
jina-embeddings-v3 39.24 55.18
UAE-Large-V1 (leaderboard) 43.09 60.52
UAE-Large-V1 43.01 59.77
stella_en_400M_v5 (leaderboard) 49.82 71.19
stella_en_400M_v5 49.67 70.67

PairClassification

model name SprintDuplicateQuestions TwitterSemEval2015
jina-embeddings-v3 (leaderboard) 96.99 70.9
jina-embeddings-v3 96.99 70.9
UAE-Large-V1 (leaderboard) 97.24 78.17
UAE-Large-V1 97.23 78.16
stella_en_400M_v5 (leaderboard) 95.59 80.18
stella_en_400M_v5 95.50 80.26

Reranking

model name SciDocsRR AskUbuntuDupQuestions
jina-embeddings-v3 (leaderboard) 84.88 65.04
jina-embeddings-v3 84.86 65.31
UAE-Large-V1 (leaderboard) 87.49 64.2
UAE-Large-V1 87.03 63.12
stella_en_400M_v5 (leaderboard) 88.44 66.15
stella_en_400M_v5 88.16 65.55

Retrieval

model name SCIDOCS SciFact
jina-embeddings-v3 (leaderboard) 19.81 72.31
jina-embeddings-v3 19.87 72.68
UAE-Large-V1 (leaderboard) 22.98 74.07
UAE-Large-V1 22.98 74.07
stella_en_400M_v5 (leaderboard) 25.04 78.23
stella_en_400M_v5 23.96 77.96

STS

model name STS16 STSBenchmark
jina-embeddings-v3 (leaderboard) 86.85 89.44
jina-embeddings-v3 86.83 89.44
UAE-Large-V1 (leaderboard) 86.61 89.06
UAE-Large-V1 86.61 89.06
stella_en_400M_v5 (leaderboard) 87.14 87.74
stella_en_400M_v5 87.00 87.56

Summarization

model name SummEval
jina-embeddings-v3 (leaderboard) 29.71
jina-embeddings-v3 29.71
UAE-Large-V1 (leaderboard) 32.03
UAE-Large-V1 31.60
stella_en_400M_v5 (leaderboard) 31.66
stella_en_400M_v5 30.59

@bwanglzu
Copy link

bwanglzu commented Oct 24, 2024

perfect thanks @Samoed ! seems our reported on Emotion is lower than what we actually have (lol).

do you mind to share me your script so that i can run a few more experiments?

BTW some of our reported score might be comes from a smaller context length such as 512, i do not recall in which dataset we evaluate on 512 context length but i believe most of the MTEB tasks except LongEMbed

@Samoed
Copy link
Member Author

Samoed commented Oct 24, 2024

Here is my code

@Samoed
Copy link
Member Author

Samoed commented Oct 24, 2024

BTW some of our reported score might be comes from a smaller context length such as 512

What do you mean? I think that mteb using SentenceTransformer uses all context length

@Samoed Samoed marked this pull request as ready for review October 24, 2024 17:57
@bwanglzu
Copy link

What do you mean? I think that mteb using SentenceTransformer uses all context length

i mean when we submit scores, their might be a small chance it is being submitted by different ppl in the team which utilise slightly different max sequence length, sometimes to speed up evaluation we use 512, sometimes we use full context length which is 8192.

Samoed and others added 2 commits October 25, 2024 22:47
Co-authored-by: Kenneth Enevoldsen <kennethcenevoldsen@gmail.com>
Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only a minor thing otherwise all good

@Samoed
Copy link
Member Author

Samoed commented Oct 28, 2024

@KennethEnevoldsen Is this PR ready for merge?

@bwanglzu
Copy link

i tested a few more benchmarks and the results are consistent, thanks @Samoed !

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! Very happy to have it merged in

@KennethEnevoldsen KennethEnevoldsen merged commit 0b846ff into embeddings-benchmark:main Oct 30, 2024
@Samoed Samoed deleted the add_jina_models branch October 20, 2025 15:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments