Skip to content

add: EmbeddingGemma 300M#269

Merged
KennethEnevoldsen merged 3 commits intoembeddings-benchmark:mainfrom
RyanMullins:embeddinggemma
Sep 4, 2025
Merged

add: EmbeddingGemma 300M#269
KennethEnevoldsen merged 3 commits intoembeddings-benchmark:mainfrom
RyanMullins:embeddinggemma

Conversation

@RyanMullins
Copy link
Contributor

Adds results for EmbeddingGemma 300M on MTEB Multilingual, English, and Code benchmarks.

Checklist

  • My model has a model sheet, report or similar
  • My model has a reference implementation in mteb/models/ this can be as an API.
  • The results submitted is obtained using the reference implementation
  • My model is available, either as a publicly accessible API or publicly on, e.g., Hugging Face Hub
  • I solemnly swear that for all results submitted I have not on the evaluation dataset including training splits. If I have I have disclosed it clearly.

cc @KennethEnevoldsen @Samoed

@github-actions
Copy link

github-actions bot commented Sep 4, 2025

Model Results Comparison

Reference models: intfloat/multilingual-e5-large, google/gemini-embedding-001
New models evaluated: google/embeddinggemma-300m
Tasks: AILAStatutes, AfriSentiClassification, AlloProfClusteringS2S.v2, AlloprofReranking, AmazonCounterfactualClassification, AppsRetrieval, ArXivHierarchicalClusteringP2P, ArXivHierarchicalClusteringS2S, ArguAna, ArmenianParaphrasePC, AskUbuntuDupQuestions, BIOSSES, BUCC.v2, Banking77Classification, BelebeleRetrieval, BibleNLPBitextMining, BigPatentClustering.v2, BiorxivClusteringP2P.v2, BornholmBitextMining, BrazilianToxicTweetsClassification, BulgarianStoreReviewSentimentClassfication, CEDRClassification, CLSClusteringP2P.v2, COIRCodeSearchNetRetrieval, CQADupstackGamingRetrieval, CQADupstackUnixRetrieval, CSFDSKMovieReviewSentimentClassification, CTKFactsNLI, CataloniaTweetClassification, ClimateFEVERHardNegatives, CodeEditSearchRetrieval, CodeFeedbackMT, CodeFeedbackST, CodeSearchNetCCRetrieval, CodeSearchNetRetrieval, CodeTransOceanContest, CodeTransOceanDL, Core17InstructionRetrieval, CosQA, CovidRetrieval, CyrillicTurkicLangClassification, CzechProductReviewSentimentClassification, DBpediaClassification, DalajClassification, DiaBlaBitextMining, EstonianValenceClassification, FEVERHardNegatives, FaroeseSTS, FiQA2018, FilipinoShopeeReviewsClassification, FinParaSTS, FinancialPhrasebankClassification, FloresBitextMining, GermanSTSBenchmark, GreekLegalCodeClassification, GujaratiNewsClassification, HALClusteringS2S.v2, HagridRetrieval, HotpotQAHardNegatives, IN22GenBitextMining, ImdbClassification, IndicCrosslingualSTS, IndicGenBenchFloresBitextMining, IndicLangClassification, IndonesianIdClickbaitClassification, IsiZuluNewsClassification, ItaCaseholdClassification, JSICK, KorHateSpeechMLClassification, KorSarcasmClassification, KurdishSentimentClassification, LEMBPasskeyRetrieval, LegalBenchCorporateLobbying, MIRACLRetrievalHardNegatives, MLQARetrieval, MTOPDomainClassification, MacedonianTweetSentimentClassification, MalteseNewsClassification, MasakhaNEWSClassification, MasakhaNEWSClusteringS2S, MassiveIntentClassification, MassiveScenarioClassification, MedrxivClusteringP2P.v2, MedrxivClusteringS2S.v2, MindSmallReranking, MultiEURLEXMultilabelClassification, MultiHateClassification, NTREXBitextMining, NepaliNewsClassification, News21InstructionRetrieval, NollySentiBitextMining, NordicLangClassification, NorwegianCourtsBitextMining, NusaParagraphEmotionClassification, NusaTranslationBitextMining, NusaX-senti, NusaXBitextMining, OdiaNewsClassification, OpusparcusPC, PAC, PawsXPairClassification, PlscClusteringP2P.v2, PoemSentimentClassification, PolEmo2.0-OUT, PpcPC, PunjabiNewsClassification, RTE3, Robust04InstructionRetrieval, RomaniBibleClustering, RuBQReranking, SCIDOCS, SIB200ClusteringS2S, SICK-R, STS12, STS13, STS14, STS15, STS17, STS22.v2, STSB, STSBenchmark, STSES, ScalaClassification, SemRel24STS, SentimentAnalysisHindi, SinhalaNewsClassification, SiswatiNewsClassification, SlovakMovieReviewSentimentClassification, SpartQA, SprintDuplicateQuestions, StackExchangeClustering.v2, StackExchangeClusteringP2P.v2, StackOverflowQA, StatcanDialogueDatasetRetrieval, SummEvalSummarization.v2, SwahiliNewsClassification, SwednClusteringP2P, SwissJudgementClassification, SyntheticText2SQL, T2Reranking, TERRa, TRECCOVID, Tatoeba, TempReasonL1, Touche2020Retrieval.v3, ToxicConversationsClassification, TswanaNewsClassification, TweetSentimentExtractionClassification, TweetTopicSingleClassification, TwentyNewsgroupsClustering.v2, TwitterHjerneRetrieval, TwitterSemEval2015, TwitterURLCorpus, VoyageMMarcoReranking, WebLINXCandidatesReranking, WikiCitiesClustering, WikiClusteringP2P.v2, WikipediaRerankingMultilingual, WikipediaRetrievalMultilingual, WinoGrande, XNLI, indonli

Results for google/embeddinggemma-300m

task_name google/embeddinggemma-300m google/gemini-embedding-001 intfloat/multilingual-e5-large Max result
AILAStatutes 0.3737 0.4877 0.2084 0.8509
AfriSentiClassification 0.4447 0.5356 0.455 0.5399
AlloProfClusteringS2S.v2 0.5282 0.5636 0.3328 0.5965
AlloprofReranking 0.7969 0.8177 0.6944 0.8513
AmazonCounterfactualClassification 0.8423 0.8820 0.6965 0.9696
AppsRetrieval 0.8439 0.9375 0.3255 0.9375
ArXivHierarchicalClusteringP2P 0.6359 0.6492 0.5569 0.6869
ArXivHierarchicalClusteringS2S 0.5959 0.6384 0.5367 0.6548
ArguAna 0.7154 0.8644 0.5436 0.8979
ArmenianParaphrasePC 0.9268 0.9689 0.9493 0.9689
AskUbuntuDupQuestions 0.6295 0.6424 0.5924 0.7020
BIOSSES 0.8638 0.8897 0.8457 0.9692
BUCC.v2 0.9875 0.9899 0.9878 0.9902
Banking77Classification 0.9145 0.9427 0.7492 0.9427
BelebeleRetrieval 0.7238 0.9073 0.7791 0.9167
BibleNLPBitextMining 0.1268 0.2072 0.1665 0.9899
BigPatentClustering.v2 0.4164 0.3806 0.3147 0.4553
BiorxivClusteringP2P.v2 0.5210 0.5386 0.372 0.5642
BornholmBitextMining 0.3463 0.5169 0.4416 0.7633
BrazilianToxicTweetsClassification 0.2234 0.2802 0.2123 0.2802
BulgarianStoreReviewSentimentClassfication 0.7126 0.7813 0.6385 0.8044
CEDRClassification 0.5276 0.5742 0.4484 0.7301
CLSClusteringP2P.v2 0.4145 0.4268 0.4037 0.7572
COIRCodeSearchNetRetrieval 0.7554 0.8106 nan 0.8951
CQADupstackGamingRetrieval 0.5952 0.7068 0.587 0.7861
CQADupstackUnixRetrieval 0.4152 0.5369 0.3988 0.7198
CSFDSKMovieReviewSentimentClassification 0.3449 0.4938 0.3484 0.6243
CTKFactsNLI 0.7934 0.8759 0.7984 0.8993
CataloniaTweetClassification 0.5123 0.5451 0.504 0.5563
ClimateFEVERHardNegatives 0.2671 0.3106 0.26 0.4900
CodeEditSearchRetrieval 0.6210 0.8161 0.5038 0.8161
CodeFeedbackMT 0.5142 0.5628 0.4278 0.9370
CodeFeedbackST 0.8026 0.8533 0.7426 0.9067
CodeSearchNetCCRetrieval 0.7371 0.8469 0.7783 0.9635
CodeSearchNetRetrieval 0.9015 0.9133 0.8412 0.9397
CodeTransOceanContest 0.8551 0.8953 0.7403 0.9496
CodeTransOceanDL 0.3352 0.3147 0.3128 0.4419
Core17InstructionRetrieval 0.0631 0.0769 -0.0162 0.1648
CosQA 0.4360 0.5024 0.348 0.5218
CovidRetrieval 0.7893 0.7913 0.7561 0.9606
CyrillicTurkicLangClassification 0.5863 0.9530 0.4085 0.9615
CzechProductReviewSentimentClassification 0.5863 0.6816 0.5714 0.6988
DBpediaClassification 0.9427 0.9476 0.8828 0.9926
DalajClassification 0.5028 0.5047 0.5001 0.5352
DiaBlaBitextMining 0.8393 0.8723 0.8483 0.8846
EstonianValenceClassification 0.3829 0.5352 0.4289 0.6820
FEVERHardNegatives 0.8075 0.8898 0.8379 0.9453
FaroeseSTS 0.6530 0.8612 0.7239 0.9739
FiQA2018 0.4774 0.6178 0.4381 0.7991
FilipinoShopeeReviewsClassification 0.4052 0.4845 0.3527 0.5052
FinParaSTS 0.2522 0.2860 0.2492 0.3399
FinancialPhrasebankClassification 0.8645 0.8864 0.8394 0.9515
FloresBitextMining 0.5535 0.8371 0.8108 0.8596
GermanSTSBenchmark 0.8467 0.8809 0.8408 0.9541
GreekLegalCodeClassification 0.2903 0.4376 0.3713 0.5648
GujaratiNewsClassification 0.8278 0.9205 0.7674 0.9205
HALClusteringS2S.v2 0.2934 0.3200 0.2261 0.3237
HagridRetrieval 0.9892 0.9931 0.9891 0.9931
HotpotQAHardNegatives 0.7148 0.8701 0.7055 0.8701
IN22GenBitextMining 0.7440 0.9375 0.7675 0.9375
ImdbClassification 0.9292 0.9498 0.8867 0.9737
IndicCrosslingualSTS 0.4307 0.6287 0.4387 0.8477
IndicGenBenchFloresBitextMining 0.8709 0.9677 0.8875 0.9881
IndicLangClassification 0.4662 0.8769 0.2025 0.9532
IndonesianIdClickbaitClassification 0.6087 0.6700 0.6122 0.6700
IsiZuluNewsClassification 0.2642 0.4053 0.3241 0.4053
ItaCaseholdClassification 0.7036 0.7330 0.6679 0.9439
JSICK 0.8440 0.8499 0.7983 0.8938
KorHateSpeechMLClassification 0.1158 0.1769 0.1049 0.2167
KorSarcasmClassification 0.5810 0.6051 0.5679 0.6629
KurdishSentimentClassification 0.5997 0.8639 0.7708 0.8639
LEMBPasskeyRetrieval 0.6075 0.3850 0.3825 1.0000
LegalBenchCorporateLobbying 0.9508 0.9598 0.8972 0.9696
MIRACLRetrievalHardNegatives 0.6620 0.7042 0.6675 0.7058
MLQARetrieval 0.7895 0.8416 0.7566 0.8416
MTOPDomainClassification 0.9636 0.9796 0.9024 0.9995
MacedonianTweetSentimentClassification 0.4531 0.7183 0.6192 0.7547
MalteseNewsClassification 0.3308 0.3738 0.2395 0.4741
MasakhaNEWSClassification 0.7493 0.8355 0.7754 0.8603
MasakhaNEWSClusteringS2S 0.4346 0.5745 0.3804 0.7182
MassiveIntentClassification 0.6270 0.8192 0.6025 0.9194
MassiveScenarioClassification 0.7163 0.8730 0.6996 0.9930
MedrxivClusteringP2P.v2 0.4411 0.4716 0.3431 0.5179
MedrxivClusteringS2S.v2 0.4193 0.4501 0.3152 0.5106
MindSmallReranking 0.3190 0.3295 0.3024 0.3437
MultiEURLEXMultilabelClassification 0.0434 0.0528 0.0516 0.0550
MultiHateClassification 0.6100 0.7247 0.6357 0.8262
NTREXBitextMining 0.7387 0.9364 0.914 0.9368
NepaliNewsClassification 0.9548 0.9814 0.8847 0.9814
News21InstructionRetrieval 0.1145 0.1026 -0.0006 0.1026
NollySentiBitextMining 0.4126 0.6871 0.675 0.8071
NordicLangClassification 0.6556 0.8597 0.8015 0.9199
NorwegianCourtsBitextMining 0.9079 0.9342 0.9404 0.9447
NusaParagraphEmotionClassification 0.4416 0.5638 0.4166 0.6538
NusaTranslationBitextMining 0.6605 0.7752 0.672 0.9222
NusaX-senti 0.6980 0.8031 0.7055 0.8093
NusaXBitextMining 0.6711 0.8252 0.7267 0.8790
OdiaNewsClassification 0.5795 0.9184 0.8001 0.9490
OpusparcusPC 0.9335 0.9662 0.9451 0.9662
PAC 0.6787 0.7168 0.7033 0.7387
PawsXPairClassification 0.5773 0.5999 0.5473 0.7524
PlscClusteringP2P.v2 0.7214 0.7431 0.7161 0.7524
PoemSentimentClassification 0.5886 0.5966 0.5067 0.7522
PolEmo2.0-OUT 0.6277 0.7753 0.3648 0.7881
PpcPC 0.9086 0.9550 0.9218 0.9550
PunjabiNewsClassification 0.8236 0.8261 0.807 0.8522
RTE3 0.8967 0.8955 0.8752 0.9123
Robust04InstructionRetrieval -0.0094 -0.0241 -0.0748 0.1372
RomaniBibleClustering 0.4188 0.4322 0.4092 0.4514
RuBQReranking 0.7126 0.7384 0.756 0.7724
SCIDOCS 0.1843 0.2515 0.1745 0.3453
SIB200ClusteringS2S 0.2653 0.4174 0.2366 0.4719
SICK-R 0.8137 0.8275 0.8023 0.9465
STS12 0.7932 0.8155 0.8002 0.9546
STS13 0.8642 0.8989 0.8155 0.9776
STS14 0.8367 0.8541 0.7772 0.9753
STS15 0.8935 0.9044 0.8931 0.9811
STS17 0.8442 0.8858 0.8214 0.9323
STS22.v2 0.7120 0.7169 0.643 0.7718
STSB 0.8164 0.8550 0.8236 0.9199
STSBenchmark 0.8816 0.8908 0.8729 0.9504
STSES 0.8231 0.8175 0.8021 0.8175
ScalaClassification 0.5077 0.5185 0.5157 0.5743
SemRel24STS 0.6522 0.7314 0.6266 0.8112
SentimentAnalysisHindi 0.6549 0.7606 0.642 0.8001
SinhalaNewsClassification 0.6567 0.8229 0.6682 0.8229
SiswatiNewsClassification 0.5700 0.6238 0.535 0.7837
SlovakMovieReviewSentimentClassification 0.7327 0.9035 0.7441 0.9441
SpartQA 0.1068 0.1030 0.0565 0.3024
SprintDuplicateQuestions 0.9703 0.9690 0.9314 0.9838
StackExchangeClustering.v2 0.9094 0.9207 0.4643 0.9207
StackExchangeClusteringP2P.v2 0.4890 0.5091 0.3854 0.5510
StackOverflowQA 0.8647 0.9671 0.8889 0.9717
StatcanDialogueDatasetRetrieval 0.4627 0.5111 0.1063 0.5807
SummEvalSummarization.v2 0.3764 0.3828 0.3141 0.3893
SwahiliNewsClassification 0.6595 0.6605 0.5969 0.6753
SwednClusteringP2P 0.4004 0.4584 0.3691 0.6213
SwissJudgementClassification 0.5774 0.5786 0.5362 0.6727
SyntheticText2SQL 0.5842 0.6996 0.5307 0.7875
T2Reranking 0.6754 0.6795 0.6632 0.7283
TERRa 0.6515 0.6392 0.5842 0.7133
TRECCOVID 0.8035 0.8631 0.7115 0.9499
Tatoeba 0.5135 0.8197 0.7573 0.9515
TempReasonL1 0.0100 0.0296 0.0114 0.0716
Touche2020Retrieval.v3 0.5890 0.5239 0.4959 0.7465
ToxicConversationsClassification 0.8293 0.8875 0.6601 0.9759
TswanaNewsClassification 0.3115 0.5337 0.47 0.5337
TweetSentimentExtractionClassification 0.6659 0.6988 0.628 0.8823
TweetTopicSingleClassification 0.7302 0.7111 0.6532 0.8171
TwentyNewsgroupsClustering.v2 0.5129 0.5737 0.3921 0.8758
TwitterHjerneRetrieval 0.7204 0.9802 0.3522 0.9802
TwitterSemEval2015 0.7794 0.7917 0.7528 0.8946
TwitterURLCorpus 0.8690 0.8705 0.8583 0.9571
VoyageMMarcoReranking 0.6099 0.6673 0.6821 0.7126
WebLINXCandidatesReranking 0.1016 0.1097 0.0778 0.1595
WikiCitiesClustering 0.9202 0.9163 0.755 0.9381
WikiClusteringP2P.v2 0.2703 0.2823 0.256 0.3234
WikipediaRerankingMultilingual 0.8988 0.9224 0.8932 0.9224
WikipediaRetrievalMultilingual 0.9000 0.9420 0.904 0.9420
WinoGrande 0.5940 0.6052 0.5498 0.7561
XNLI 0.8173 0.8526 0.7477 0.8907
indonli 0.6095 0.6069 0.5174 0.6683
Average 0.6169 0.6863 0.5822 0.7615

@RyanMullins
Copy link
Contributor Author

@Samoed there seems to be something going on with the one test where it thinks the JSON files are directories that would contain a model_meta.json file. Any idea what I should do to resolve?

Also, do I need to do anything here to support the "gemma" license like we did in embeddings-benchmark/mteb#3129?

Copy link
Member

@Samoed Samoed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to put your results under revision folder. In your case this should be results/google__embeddinggemma-300m/64614b0b8b64f0c6c1e52b07e4e9a4e8fe4d2da2/...

@RyanMullins
Copy link
Contributor Author

Thanks, @Samoed ! Updated the directories and re-running CI

@RyanMullins
Copy link
Contributor Author

@Samoed now I'm really confused. These really look like the same path to me (and Chrome's Find on Page function).
Screenshot 2025-09-04 at 12 01 54 PM

@Samoed
Copy link
Member

Samoed commented Sep 4, 2025

I too. I'll try to look

@RyanMullins
Copy link
Contributor Author

@Samoed got it. The difference between "M" and "m" matters.

@Samoed
Copy link
Member

Samoed commented Sep 4, 2025

Do you have ideas why you have so big score on STSES and News21InstructionRetrieval?

@RyanMullins
Copy link
Contributor Author

Let me check with the team.

@RyanMullins
Copy link
Contributor Author

Bringing in @schechterh, who ran the evals for the model. We're investigating these scores now.

@schechterh
Copy link

Found the eval jobs these two numbers came from and can confirm those are the results we got.

If it helps, they were run on four TPUv4 chips, on JAX, with parameters loaded as bfloat16.

@orionw
Copy link
Contributor

orionw commented Sep 4, 2025

FWIW, the New21 results are very strong but within a reasonable range @Samoed (around/better than Promptriever numbers). I think it mainly means they added some instructions in training but of course I know they can't confirm nor deny that ;)

Congrats @schechterh and @RyanMullins on the awesome release!

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @Samoed, and thanks @RyanMullins and @schechterh for double-checking the scores.

I think we can reasonably merge this, congratulations on the release!

@KennethEnevoldsen KennethEnevoldsen merged commit d85b568 into embeddings-benchmark:main Sep 4, 2025
3 checks passed
@ShreyGanatra
Copy link

Hi @RyanMullins @schechterh, I am running the model for MTEB eval trying to reproduce the results.

import mteb
from sentence_transformers import SentenceTransformer
# Define the sentence-transformers model name

model_name = "google/embeddinggemma-300m"
model = mteb.get_model(model_name)
# model =  SentenceTransformer(model_name)
 
tasks = mteb.get_tasks(tasks=["ArguAna"])
evaluation = mteb.MTEB(tasks=tasks)
results = evaluation.run(model, output_folder=f"results_{model_name}_mteb", verbosity=1, overwrite_results=True)

I am getting ndcg_at_10: 0.30593 vs 0.71535 reported. Can you point out the error/bug? It will be very helpful

@Samoed
Copy link
Member

Samoed commented Sep 17, 2025

Hm, I get too 0.3061

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants