add: EmbeddingGemma 300M#269
Conversation
Model Results ComparisonReference models: Results for
|
| task_name | google/embeddinggemma-300m | google/gemini-embedding-001 | intfloat/multilingual-e5-large | Max result |
|---|---|---|---|---|
| AILAStatutes | 0.3737 | 0.4877 | 0.2084 | 0.8509 |
| AfriSentiClassification | 0.4447 | 0.5356 | 0.455 | 0.5399 |
| AlloProfClusteringS2S.v2 | 0.5282 | 0.5636 | 0.3328 | 0.5965 |
| AlloprofReranking | 0.7969 | 0.8177 | 0.6944 | 0.8513 |
| AmazonCounterfactualClassification | 0.8423 | 0.8820 | 0.6965 | 0.9696 |
| AppsRetrieval | 0.8439 | 0.9375 | 0.3255 | 0.9375 |
| ArXivHierarchicalClusteringP2P | 0.6359 | 0.6492 | 0.5569 | 0.6869 |
| ArXivHierarchicalClusteringS2S | 0.5959 | 0.6384 | 0.5367 | 0.6548 |
| ArguAna | 0.7154 | 0.8644 | 0.5436 | 0.8979 |
| ArmenianParaphrasePC | 0.9268 | 0.9689 | 0.9493 | 0.9689 |
| AskUbuntuDupQuestions | 0.6295 | 0.6424 | 0.5924 | 0.7020 |
| BIOSSES | 0.8638 | 0.8897 | 0.8457 | 0.9692 |
| BUCC.v2 | 0.9875 | 0.9899 | 0.9878 | 0.9902 |
| Banking77Classification | 0.9145 | 0.9427 | 0.7492 | 0.9427 |
| BelebeleRetrieval | 0.7238 | 0.9073 | 0.7791 | 0.9167 |
| BibleNLPBitextMining | 0.1268 | 0.2072 | 0.1665 | 0.9899 |
| BigPatentClustering.v2 | 0.4164 | 0.3806 | 0.3147 | 0.4553 |
| BiorxivClusteringP2P.v2 | 0.5210 | 0.5386 | 0.372 | 0.5642 |
| BornholmBitextMining | 0.3463 | 0.5169 | 0.4416 | 0.7633 |
| BrazilianToxicTweetsClassification | 0.2234 | 0.2802 | 0.2123 | 0.2802 |
| BulgarianStoreReviewSentimentClassfication | 0.7126 | 0.7813 | 0.6385 | 0.8044 |
| CEDRClassification | 0.5276 | 0.5742 | 0.4484 | 0.7301 |
| CLSClusteringP2P.v2 | 0.4145 | 0.4268 | 0.4037 | 0.7572 |
| COIRCodeSearchNetRetrieval | 0.7554 | 0.8106 | nan | 0.8951 |
| CQADupstackGamingRetrieval | 0.5952 | 0.7068 | 0.587 | 0.7861 |
| CQADupstackUnixRetrieval | 0.4152 | 0.5369 | 0.3988 | 0.7198 |
| CSFDSKMovieReviewSentimentClassification | 0.3449 | 0.4938 | 0.3484 | 0.6243 |
| CTKFactsNLI | 0.7934 | 0.8759 | 0.7984 | 0.8993 |
| CataloniaTweetClassification | 0.5123 | 0.5451 | 0.504 | 0.5563 |
| ClimateFEVERHardNegatives | 0.2671 | 0.3106 | 0.26 | 0.4900 |
| CodeEditSearchRetrieval | 0.6210 | 0.8161 | 0.5038 | 0.8161 |
| CodeFeedbackMT | 0.5142 | 0.5628 | 0.4278 | 0.9370 |
| CodeFeedbackST | 0.8026 | 0.8533 | 0.7426 | 0.9067 |
| CodeSearchNetCCRetrieval | 0.7371 | 0.8469 | 0.7783 | 0.9635 |
| CodeSearchNetRetrieval | 0.9015 | 0.9133 | 0.8412 | 0.9397 |
| CodeTransOceanContest | 0.8551 | 0.8953 | 0.7403 | 0.9496 |
| CodeTransOceanDL | 0.3352 | 0.3147 | 0.3128 | 0.4419 |
| Core17InstructionRetrieval | 0.0631 | 0.0769 | -0.0162 | 0.1648 |
| CosQA | 0.4360 | 0.5024 | 0.348 | 0.5218 |
| CovidRetrieval | 0.7893 | 0.7913 | 0.7561 | 0.9606 |
| CyrillicTurkicLangClassification | 0.5863 | 0.9530 | 0.4085 | 0.9615 |
| CzechProductReviewSentimentClassification | 0.5863 | 0.6816 | 0.5714 | 0.6988 |
| DBpediaClassification | 0.9427 | 0.9476 | 0.8828 | 0.9926 |
| DalajClassification | 0.5028 | 0.5047 | 0.5001 | 0.5352 |
| DiaBlaBitextMining | 0.8393 | 0.8723 | 0.8483 | 0.8846 |
| EstonianValenceClassification | 0.3829 | 0.5352 | 0.4289 | 0.6820 |
| FEVERHardNegatives | 0.8075 | 0.8898 | 0.8379 | 0.9453 |
| FaroeseSTS | 0.6530 | 0.8612 | 0.7239 | 0.9739 |
| FiQA2018 | 0.4774 | 0.6178 | 0.4381 | 0.7991 |
| FilipinoShopeeReviewsClassification | 0.4052 | 0.4845 | 0.3527 | 0.5052 |
| FinParaSTS | 0.2522 | 0.2860 | 0.2492 | 0.3399 |
| FinancialPhrasebankClassification | 0.8645 | 0.8864 | 0.8394 | 0.9515 |
| FloresBitextMining | 0.5535 | 0.8371 | 0.8108 | 0.8596 |
| GermanSTSBenchmark | 0.8467 | 0.8809 | 0.8408 | 0.9541 |
| GreekLegalCodeClassification | 0.2903 | 0.4376 | 0.3713 | 0.5648 |
| GujaratiNewsClassification | 0.8278 | 0.9205 | 0.7674 | 0.9205 |
| HALClusteringS2S.v2 | 0.2934 | 0.3200 | 0.2261 | 0.3237 |
| HagridRetrieval | 0.9892 | 0.9931 | 0.9891 | 0.9931 |
| HotpotQAHardNegatives | 0.7148 | 0.8701 | 0.7055 | 0.8701 |
| IN22GenBitextMining | 0.7440 | 0.9375 | 0.7675 | 0.9375 |
| ImdbClassification | 0.9292 | 0.9498 | 0.8867 | 0.9737 |
| IndicCrosslingualSTS | 0.4307 | 0.6287 | 0.4387 | 0.8477 |
| IndicGenBenchFloresBitextMining | 0.8709 | 0.9677 | 0.8875 | 0.9881 |
| IndicLangClassification | 0.4662 | 0.8769 | 0.2025 | 0.9532 |
| IndonesianIdClickbaitClassification | 0.6087 | 0.6700 | 0.6122 | 0.6700 |
| IsiZuluNewsClassification | 0.2642 | 0.4053 | 0.3241 | 0.4053 |
| ItaCaseholdClassification | 0.7036 | 0.7330 | 0.6679 | 0.9439 |
| JSICK | 0.8440 | 0.8499 | 0.7983 | 0.8938 |
| KorHateSpeechMLClassification | 0.1158 | 0.1769 | 0.1049 | 0.2167 |
| KorSarcasmClassification | 0.5810 | 0.6051 | 0.5679 | 0.6629 |
| KurdishSentimentClassification | 0.5997 | 0.8639 | 0.7708 | 0.8639 |
| LEMBPasskeyRetrieval | 0.6075 | 0.3850 | 0.3825 | 1.0000 |
| LegalBenchCorporateLobbying | 0.9508 | 0.9598 | 0.8972 | 0.9696 |
| MIRACLRetrievalHardNegatives | 0.6620 | 0.7042 | 0.6675 | 0.7058 |
| MLQARetrieval | 0.7895 | 0.8416 | 0.7566 | 0.8416 |
| MTOPDomainClassification | 0.9636 | 0.9796 | 0.9024 | 0.9995 |
| MacedonianTweetSentimentClassification | 0.4531 | 0.7183 | 0.6192 | 0.7547 |
| MalteseNewsClassification | 0.3308 | 0.3738 | 0.2395 | 0.4741 |
| MasakhaNEWSClassification | 0.7493 | 0.8355 | 0.7754 | 0.8603 |
| MasakhaNEWSClusteringS2S | 0.4346 | 0.5745 | 0.3804 | 0.7182 |
| MassiveIntentClassification | 0.6270 | 0.8192 | 0.6025 | 0.9194 |
| MassiveScenarioClassification | 0.7163 | 0.8730 | 0.6996 | 0.9930 |
| MedrxivClusteringP2P.v2 | 0.4411 | 0.4716 | 0.3431 | 0.5179 |
| MedrxivClusteringS2S.v2 | 0.4193 | 0.4501 | 0.3152 | 0.5106 |
| MindSmallReranking | 0.3190 | 0.3295 | 0.3024 | 0.3437 |
| MultiEURLEXMultilabelClassification | 0.0434 | 0.0528 | 0.0516 | 0.0550 |
| MultiHateClassification | 0.6100 | 0.7247 | 0.6357 | 0.8262 |
| NTREXBitextMining | 0.7387 | 0.9364 | 0.914 | 0.9368 |
| NepaliNewsClassification | 0.9548 | 0.9814 | 0.8847 | 0.9814 |
| News21InstructionRetrieval | 0.1145 | 0.1026 | -0.0006 | 0.1026 |
| NollySentiBitextMining | 0.4126 | 0.6871 | 0.675 | 0.8071 |
| NordicLangClassification | 0.6556 | 0.8597 | 0.8015 | 0.9199 |
| NorwegianCourtsBitextMining | 0.9079 | 0.9342 | 0.9404 | 0.9447 |
| NusaParagraphEmotionClassification | 0.4416 | 0.5638 | 0.4166 | 0.6538 |
| NusaTranslationBitextMining | 0.6605 | 0.7752 | 0.672 | 0.9222 |
| NusaX-senti | 0.6980 | 0.8031 | 0.7055 | 0.8093 |
| NusaXBitextMining | 0.6711 | 0.8252 | 0.7267 | 0.8790 |
| OdiaNewsClassification | 0.5795 | 0.9184 | 0.8001 | 0.9490 |
| OpusparcusPC | 0.9335 | 0.9662 | 0.9451 | 0.9662 |
| PAC | 0.6787 | 0.7168 | 0.7033 | 0.7387 |
| PawsXPairClassification | 0.5773 | 0.5999 | 0.5473 | 0.7524 |
| PlscClusteringP2P.v2 | 0.7214 | 0.7431 | 0.7161 | 0.7524 |
| PoemSentimentClassification | 0.5886 | 0.5966 | 0.5067 | 0.7522 |
| PolEmo2.0-OUT | 0.6277 | 0.7753 | 0.3648 | 0.7881 |
| PpcPC | 0.9086 | 0.9550 | 0.9218 | 0.9550 |
| PunjabiNewsClassification | 0.8236 | 0.8261 | 0.807 | 0.8522 |
| RTE3 | 0.8967 | 0.8955 | 0.8752 | 0.9123 |
| Robust04InstructionRetrieval | -0.0094 | -0.0241 | -0.0748 | 0.1372 |
| RomaniBibleClustering | 0.4188 | 0.4322 | 0.4092 | 0.4514 |
| RuBQReranking | 0.7126 | 0.7384 | 0.756 | 0.7724 |
| SCIDOCS | 0.1843 | 0.2515 | 0.1745 | 0.3453 |
| SIB200ClusteringS2S | 0.2653 | 0.4174 | 0.2366 | 0.4719 |
| SICK-R | 0.8137 | 0.8275 | 0.8023 | 0.9465 |
| STS12 | 0.7932 | 0.8155 | 0.8002 | 0.9546 |
| STS13 | 0.8642 | 0.8989 | 0.8155 | 0.9776 |
| STS14 | 0.8367 | 0.8541 | 0.7772 | 0.9753 |
| STS15 | 0.8935 | 0.9044 | 0.8931 | 0.9811 |
| STS17 | 0.8442 | 0.8858 | 0.8214 | 0.9323 |
| STS22.v2 | 0.7120 | 0.7169 | 0.643 | 0.7718 |
| STSB | 0.8164 | 0.8550 | 0.8236 | 0.9199 |
| STSBenchmark | 0.8816 | 0.8908 | 0.8729 | 0.9504 |
| STSES | 0.8231 | 0.8175 | 0.8021 | 0.8175 |
| ScalaClassification | 0.5077 | 0.5185 | 0.5157 | 0.5743 |
| SemRel24STS | 0.6522 | 0.7314 | 0.6266 | 0.8112 |
| SentimentAnalysisHindi | 0.6549 | 0.7606 | 0.642 | 0.8001 |
| SinhalaNewsClassification | 0.6567 | 0.8229 | 0.6682 | 0.8229 |
| SiswatiNewsClassification | 0.5700 | 0.6238 | 0.535 | 0.7837 |
| SlovakMovieReviewSentimentClassification | 0.7327 | 0.9035 | 0.7441 | 0.9441 |
| SpartQA | 0.1068 | 0.1030 | 0.0565 | 0.3024 |
| SprintDuplicateQuestions | 0.9703 | 0.9690 | 0.9314 | 0.9838 |
| StackExchangeClustering.v2 | 0.9094 | 0.9207 | 0.4643 | 0.9207 |
| StackExchangeClusteringP2P.v2 | 0.4890 | 0.5091 | 0.3854 | 0.5510 |
| StackOverflowQA | 0.8647 | 0.9671 | 0.8889 | 0.9717 |
| StatcanDialogueDatasetRetrieval | 0.4627 | 0.5111 | 0.1063 | 0.5807 |
| SummEvalSummarization.v2 | 0.3764 | 0.3828 | 0.3141 | 0.3893 |
| SwahiliNewsClassification | 0.6595 | 0.6605 | 0.5969 | 0.6753 |
| SwednClusteringP2P | 0.4004 | 0.4584 | 0.3691 | 0.6213 |
| SwissJudgementClassification | 0.5774 | 0.5786 | 0.5362 | 0.6727 |
| SyntheticText2SQL | 0.5842 | 0.6996 | 0.5307 | 0.7875 |
| T2Reranking | 0.6754 | 0.6795 | 0.6632 | 0.7283 |
| TERRa | 0.6515 | 0.6392 | 0.5842 | 0.7133 |
| TRECCOVID | 0.8035 | 0.8631 | 0.7115 | 0.9499 |
| Tatoeba | 0.5135 | 0.8197 | 0.7573 | 0.9515 |
| TempReasonL1 | 0.0100 | 0.0296 | 0.0114 | 0.0716 |
| Touche2020Retrieval.v3 | 0.5890 | 0.5239 | 0.4959 | 0.7465 |
| ToxicConversationsClassification | 0.8293 | 0.8875 | 0.6601 | 0.9759 |
| TswanaNewsClassification | 0.3115 | 0.5337 | 0.47 | 0.5337 |
| TweetSentimentExtractionClassification | 0.6659 | 0.6988 | 0.628 | 0.8823 |
| TweetTopicSingleClassification | 0.7302 | 0.7111 | 0.6532 | 0.8171 |
| TwentyNewsgroupsClustering.v2 | 0.5129 | 0.5737 | 0.3921 | 0.8758 |
| TwitterHjerneRetrieval | 0.7204 | 0.9802 | 0.3522 | 0.9802 |
| TwitterSemEval2015 | 0.7794 | 0.7917 | 0.7528 | 0.8946 |
| TwitterURLCorpus | 0.8690 | 0.8705 | 0.8583 | 0.9571 |
| VoyageMMarcoReranking | 0.6099 | 0.6673 | 0.6821 | 0.7126 |
| WebLINXCandidatesReranking | 0.1016 | 0.1097 | 0.0778 | 0.1595 |
| WikiCitiesClustering | 0.9202 | 0.9163 | 0.755 | 0.9381 |
| WikiClusteringP2P.v2 | 0.2703 | 0.2823 | 0.256 | 0.3234 |
| WikipediaRerankingMultilingual | 0.8988 | 0.9224 | 0.8932 | 0.9224 |
| WikipediaRetrievalMultilingual | 0.9000 | 0.9420 | 0.904 | 0.9420 |
| WinoGrande | 0.5940 | 0.6052 | 0.5498 | 0.7561 |
| XNLI | 0.8173 | 0.8526 | 0.7477 | 0.8907 |
| indonli | 0.6095 | 0.6069 | 0.5174 | 0.6683 |
| Average | 0.6169 | 0.6863 | 0.5822 | 0.7615 |
|
@Samoed there seems to be something going on with the one test where it thinks the JSON files are directories that would contain a Also, do I need to do anything here to support the "gemma" license like we did in embeddings-benchmark/mteb#3129? |
Samoed
left a comment
There was a problem hiding this comment.
You need to put your results under revision folder. In your case this should be results/google__embeddinggemma-300m/64614b0b8b64f0c6c1e52b07e4e9a4e8fe4d2da2/...
|
Thanks, @Samoed ! Updated the directories and re-running CI |
|
@Samoed now I'm really confused. These really look like the same path to me (and Chrome's Find on Page function). |
|
I too. I'll try to look |
|
@Samoed got it. The difference between "M" and "m" matters. |
|
Do you have ideas why you have so big score on |
|
Let me check with the team. |
|
Bringing in @schechterh, who ran the evals for the model. We're investigating these scores now. |
|
Found the eval jobs these two numbers came from and can confirm those are the results we got. If it helps, they were run on four TPUv4 chips, on JAX, with parameters loaded as bfloat16. |
|
FWIW, the New21 results are very strong but within a reasonable range @Samoed (around/better than Promptriever numbers). I think it mainly means they added some instructions in training but of course I know they can't confirm nor deny that ;) Congrats @schechterh and @RyanMullins on the awesome release! |
KennethEnevoldsen
left a comment
There was a problem hiding this comment.
Thanks for the review @Samoed, and thanks @RyanMullins and @schechterh for double-checking the scores.
I think we can reasonably merge this, congratulations on the release!
|
Hi @RyanMullins @schechterh, I am running the model for MTEB eval trying to reproduce the results. import mteb
from sentence_transformers import SentenceTransformer
# Define the sentence-transformers model name
model_name = "google/embeddinggemma-300m"
model = mteb.get_model(model_name)
# model = SentenceTransformer(model_name)
tasks = mteb.get_tasks(tasks=["ArguAna"])
evaluation = mteb.MTEB(tasks=tasks)
results = evaluation.run(model, output_folder=f"results_{model_name}_mteb", verbosity=1, overwrite_results=True)I am getting ndcg_at_10: 0.30593 vs 0.71535 reported. Can you point out the error/bug? It will be very helpful |
|
Hm, I get too |

Adds results for EmbeddingGemma 300M on MTEB Multilingual, English, and Code benchmarks.
Checklist
mteb/models/this can be as an API.cc @KennethEnevoldsen @Samoed